[Bro-Dev] Broker raw throughput
dominik.charousset at haw-hamburg.de
Tue Mar 8 07:09:30 PST 2016
> the benchmark no longer
> terminates and the server quickly stops getting data, and I would like
> to know why.
I'll have a look at it.
> I've tried various parameters for the scheduler throughput, but they do
> not seem to make a difference. Would you mind taking a look at what's
> going on here?
The throughput parameter does not apply to network inputs, so you only modify how many integers per scheduler run the server receives. You could additionally try to tweak caf#middleman.max_consecutive_reads, which configures how many new_data_msg messages a broker receives from the backend in a single shot. It makes sense to have the two separated, because one configures fairness in the scheduling and the other fairness of connection multiplexing.
> It looks like the "sender overload protection" you
> mentioned is not working as expected.
The new feature "merely" allows (CAF) brokers to receive messages from the backend when data is transferred. This basically uplifts TCP's backpressure. When blindly throwing messages at remote actors, there's nothing CAF could do about it. However, the new broker feedback will be one piece in the puzzle when implementing flow control in CAF later on.
> I'm also attaching a new gperftools profiler output from the client and
> server. The server is not too telling, because it was spinning idle for
> a bit until I ran the client, hence the high CPU load in nanosleep.
> Looking at the client, it seems that only 67.3% of time is spent in
> local_actor::resume, which would mean that the runtime adds 33.7%
The call to resume() happens in the BASP broker which dumps the messages to its output buffer. So the 67% load include serialization, etc. 28.3% of the remaining load are accumulated in main().
> Still, why is intrusive_ptr::get consuming 27.9%?
The 27.9% is accumulating all load down the path, isn't it? intrusive_ptr::get itself simply returns a pointer: https://github.com/actor-framework/actor-framework/blob/d5f43de65c42a74afa4c979ae4f60292f71e371f/libcaf_core/caf/intrusive_ptr.hpp#L128
> Looking on the left tree, it looks like this workload stresses the
> allocator heavily:
> - 20.4% tc_malloc_skip_new_handler
> - 7% std::vector::insert in the BASP broker
> - 13.5% CAF serialization (adding two out-edges from
> basp::instance::write, 5.8 + 7.5)
Not really surprising. You are sending integers around. Each integer has to be wrapped in a heap-allocated message which gets enqueued to an actor's mailbox. By using many small messages, you basically maximize the messaging overhead.
> Switching gears to your own performance measurements: it sounded like
> that you got gains at the order 400% when comparing just raw byte
> throughput (as opposed to message throughput). Can you give us an
> intuition how that relates to the throughput measurements we have been
At the lowest level, a framework like CAF ultimately needs to efficiently manage buffers and events provided by the OS. That's the functionality of recv/send/poll/epoll and friends. That's what I was looking at, since you can't get good performance if you have problems at that level (which, as it turned out, CAF had).
Moving a few layers up, some overhead is inherent in a messaging framework. Stressing the heap (see 20% load in tc_malloc_skip_new_handler) when sending many small messages, for example.
>From the gperf output (just looking at the client), I don't see that much CPU time spent in CAF itself. If I sum up CPU load from std::vector (6.2%), tcmalloc (20.4%), atomics (8%) and serialization (12.4%), I'm already at 47% out of 70% total for the multiplexer (default_multiplexer::run).
Pattern Matching (caf::detail::try_match) cause less than 6% CPU load, so that seems not to be an issue. Serialization has 12% CPU load, which probably mostly results from std::copy (cut out after std::function unfortunately). So, I don't see that many optimization opportunities in these components.
Tackling the "many small messages problem" isn't going to be easy. CAF could try to wrap multiple messages from the network into a single heap-allocated storage that is then shipped to an actor as a whole, but this optimization would have a high complexity.
That's of course just some thoughts after looking at the gperf output you provided. I'll hopefully have new insights after looking at the termination problem in detail.
More information about the bro-dev