Variability of BBR throughput #1758

huitema · 2024-09-30T19:06:48Z

A simple test of throughput can be done using picoquicdemo on local loopback on a Windows PC. (The same test on Unix devices is less interesting, because the loopback path on Linux is very different from what happens with actual network sockets.) The test works by starting a server on on terminal window using picoquicdemo.exe -p 4433 and running a client from another windows as picoquicdemo.exe -D -G {bbr|cubic} -n test ::1 4433 /1000000000. After running 5 tests using BBR and another 5 using Cubic on a Dell laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) we get the following results:

CC (on PC)	Gbps	P/trains	Loss rate	CWIN (p)	Pacing (Gbps)
cubic	3.2	16	0.11%	878	226.5
cubic	3.1	17	0.05%	4477	645.9
cubic	2.9	16	0.09%	466	415.5
cubic	2.6	17	0.17%	359	750.9
cubic	3.1	17	0.11%	345	802.2
bbr	1.5	2	0.08%	137	2.3
bbr	2.4	6	0.05%	279	3.4
bbr	1.1	2	0.11%	269	1.8
bbr	1.8	3	0.08%	182	1.7
bbr	1.7	3	0.20%	273	2.7

For each connection, we listed:

The congestion control algorithm
The average throughput measured when sending 1 GB of data
The number of packets sent in a single sendmsg call, which we call packets per train,
The packet loss rate, defined as the number of packets repeated divided by total number of packets,
The observed congestion window, expressed in number of packets, observed at the end of the session
The pacing rate, observed at the end of the session

The obvious conclusion is that Cubic is much faster in that specific test, with an average data rate of about 3 Gbps, versus about 1.7 Gbps for BBR. We observe that much of the slow down in the BBR tests is due to the conservative pacing rate, leading to a small number of packets per "sendmsg" call. The pacing rate is the main control parameter in BBR, and it does not evolve to reflect the capacity of the loopback path.

We expect some variability between tests, for example because other programs and services running on the laptop may cause variations in available CPU. We do see some of that variability when using Cubic, with observed data rates between 2.6 and 3.2 Gbps, but we see a much higher variability using BBR, with data rates between 1.1 Gbps and 2.4 Gbps. This confirms other observations that small changes in the environment can produce big variations in BBR performance. We need to investigate the cause of these variations.

The text was updated successfully, but these errors were encountered:

huitema · 2024-10-01T19:07:25Z

Investigating BBR requires doing logging in a way that does not impact performance. The current options, qlog or even binary log, are not adequate. Turning the binary log option on for a transfer of 1GB, the data rate drops to 400Mbps while producing a 200MB file. Instead, we should try to log the key variables in memory, and save them to a csv file at the end of the connection.

alexrabi · 2024-12-07T09:17:55Z

I am not sure whether this is related to this problem in particular or not, but I have also noticed some strange issues with BRR, likely related to the pacing rate.

In my scenario, I am using MP-QUIC together with QUIC datagrams in a VPN-like setup, where traffic is being tunneled across a MP-QUIC connection. If I let the traffic going into the tunnel to be split across all paths, everything seems to work well. However, splitting the traffic is not necessarily a good idea for all types of flows, because heterogeneous path characteristics may lead to excessive reordering, which in turn may lead to issues such as, for example, spurious retransmissions and unnecessary CWND reductions if the underlying traffic going into the tunnel is itself congestion controlled. It may therefore be preferable to limit the use of the number of MP-QUIC paths to just one in certain scenarios. However, limiting the use of paths (i.e. using the picoquic_mark_datagram_ready_path API to explicitly tell picoquic which path to send the datagrams on, while keeping the other paths "up" but inactive) seems to have some strange effects on the congestion control.

In one scenario, where I am sending UDP-based traffic with a constant bit rate (not congestion controlled) across the tunnel, I have noticed a massive amount of packet losses and very high latency for the few packets that actually made it through (~333ms RTT, when the base RTT is ~15ms), despite the fact that the data rate of the UDP flow is much lower than the capacity of the network. Again, I do not see the same behavior if I split the traffic across all paths, only when limiting the sending of the datagrams to one path. Setting up a connection with only a single path available (while still negotiating and agreeing to use MP-QUIC) also doesn't seem to display the same issue. Curiously, this problem only seems to happen on the server side as well. Using cubic on the server instead of BBR solves this issue, so I am fairly certain that it is not a problem with the tunnel framework, but rather that something is off with BBR; likely the pacing rate.

It could just be a coincidence of course, but the ~333ms RTT that I have been seeing seems oddly specific, as if some default value is being used for the pacing rate. Is the pacing rate calculated for the "inactive" paths being applied to ALL paths? And why does it only seem to occur on the server side?

huitema · 2024-12-07T20:57:47Z

That's a bizarre result. The pacing rate is definitely per path. We would need some traces to understand what is happening in your scenario. Just to check, did you try setting the CC to "cubic" to see whether the issue persists?

alexrabi · 2024-12-08T09:39:40Z

I will try to produce some traces to help you figure out what is going on. As I mentioned, setting CC to cubic does not show the same issue, so it is definitely something related to BBR.

alexrabi · 2024-12-16T10:25:58Z

Alright, so I have done some more testing, and it does seem to be reproducible in a few different environments. You can find logs here: logs.zip

Some interesting observations:

The problem occurs for BBRv1 and BBRv3 on the server side, but I have not encountered it when using any other CCA.
The choice of CCA on the client side does not appear to matter.
The problem only occurs when the path is application limited. However, if the path has been under load (i.e. the path capacity has been saturated at some point prior, e.g. by running an iperf test through the tunnel), everything seems to be working as expected even when running application limited traffic.
Looking at the logs, it seems like the client side is correctly identifying that it is being application limited, but that does not appear to be the case for the server side.

This may hint at some issue with the BBR startup phase, coupled with the server incorrectly assuming that it is not application limited?

huitema · 2024-12-16T19:19:37Z

The "application limited" test with BBR is indeed more difficult than with Cubic or Reno. Cubic and Reno have a single control variable, the congestion window, so a simple test of bytes_in_flight < cwnd will return whether the traffic is app limited. BBR has two control variables. The main control variable is the pacing rate, but BBR also uses the congestion window, either as a safety to not send too much traffic in the absence of feedback, or as a short term limiter if packet losses were detected. Testing on the congestion window is imprecise, because in normal scenarios pacing dominates and the bytes in flight remain below the congestion window. Testing on the pacing rate is also imprecise, because pacing is implemented with a leaky bucket. There will be brief period in which the leaky bucket will be full, letting packets go without pacing, only limiting the last packet out of a train -- but that can happen whether the application is "limiting" or not. For example, an application that sends a large frame periodically will experience pacing, even though the average traffic is well below capacity.

That's a bit of a mess. Suggestions are welcome.

alexrabi · 2024-12-17T08:39:32Z

How about using the RTT as a test? BBR is different from Cubic or Reno in that it actively probes for the "non-congested" round-trip propagation time ("min_rtt") by attempting to drain the queue from the bottleneck. If the sender is application limited, it should also not contribute to any queue build-up at the bottleneck, and therefore the SRTT should not differ significantly from the min_rtt. This is not fool proof, of course, as the RTT probing phase is only done periodically. Changes to path characteristics, such as increased latency due to mobility or via the addition of competing flows at the bottleneck, could therefore be misinterpreted as no longer being application limited. It might be worth considering combining a few different tests using different metrics to determine whether a path is application or not. I.e. if both the pacing rate method and the above described method indicates that we are not application limited, only then should we consider us to no longer being application limited.

huitema · 2024-12-17T20:55:09Z

The problem is time span. Suppose at application that sends 5Mbps of traffic on a 20 Mbps path -- kind of the definition of "app limited". Suppose now that the 5 Mbps of traffic consists of periodic video frames, i.e., one large message 30 times per second. That message will be sent in 8 or 9 ms -- 1/4th of the link capability. There will be no long term queue, but there will be an 8 or 9 ms queue built up for each frame. We can have a transient measurement that indicates saturation when the long term measurement does not.

huitema · 2024-12-17T21:03:16Z

But the handling of app limited is really a different issue. The issue here was opened after test of transfers on a loopback interface, when the application was trying to send 1GB or 10BG of data -- definitely not an app limited scenario. The issue is analyzed in detail in the blog "Loopback ACK delays cause Cubic slowdowns." The Cubic issue was traced to bad behavior of the "ACK Frequency" implementation when running over very low latency paths, such as the loopback path. Fixing that fixed a lot of the BBR issues, but some remain. BBR is very complex, and thus hard to tune.

alexrabi · 2024-12-18T08:25:41Z

Indeed, these seem to be different issues entirely. I will create a separate issue for the application limited scenarios.

Speaking of this issue at hand, you say that the test is less interesting to run the test on Linux since using the loopback interface is very different from what happens with actual network sockets. Perhaps the issue could be replicated on Linux using network namespaces instead (e.g. Docker containers, Mininet), since that should actually make the packets go through the kernel "normally"?

huitema · 2024-12-18T16:40:06Z

On 12/18/2024 12:26 AM, alexrabi wrote: Indeed, these seem to be different issues entirely. I will create a separate issue for the application limited scenarios. Speaking of this issue at hand, you say that the test is less interesting to run the test on Linux since using the loopback interface is very different from what happens with actual network sockets. Perhaps the issue could be replicated on Linux using network namespaces instead (e.g. Docker containers, Mininet), since that should actually make the packets go through the kernel "normally"?

Maybe. I have no experience with that, but if you want to step in, you are welcome!

…

-- Christian Huitema

joergdeutschmann-i7 · 2024-12-18T20:09:50Z

Very interesting discussion here. I just wanted to chime in regarding the following:

Perhaps the issue could be replicated on Linux using network namespaces instead (e.g. Docker containers, Mininet), since that should actually make the packets go through the kernel "normally"?

I think also lightweight virtualization may impact your timings. Example (pdf).
NetEm can be really fun, too. (A colleague of mine recently fixed some NetEm issues.)
Out of helplessness, I sometimes do weird cabling setups in order to have more "realistic" topologies. Not sure if this is really a good approach. Also my colleague created (yet another?) testbed framework based on full operating system virtualization, but we've not evaluated detailed packet timings using that framework, yet.

alexrabi · 2024-12-18T20:55:42Z

I think also lightweight virtualization may impact your timings. Example (pdf).
NetEm can be really fun, too. (A colleague of mine recently fixed some NetEm issues.)

This seems to be an issue with the netem qdisc rather than Mininet itself, no? To emulate a "loopback" interface this way we would not need to impose any artificial delays nor artificial bandwidth (i.e. netem would not be required), as the CPU should be the main bottleneck, so I am not sure how applicable this is problem is for this particular scenario. Good to keep in mind though!

joergdeutschmann-i7 · 2024-12-18T21:05:45Z

Mininet relies on NetEm according to the first reference and my knowledge. You're right, if you don't need artificial link emulation this might all not matter (to me it's then more testing of the implementation performance than congestion control performance), but maybe you also want to test the congestion control for different types of paths.

alexrabi · 2024-12-19T09:56:04Z

Mininet only uses the netem for emulating latency (the first reference even says as much). If you do not add any artificial latency to the links, then Mininet can use any "normal" qdisc such as fq or fq-codel. In any case, one has to be careful not to draw too many conclusions from emulated results, you are absolutely right.

As for the particular scenario in the first reference, I would argue that picoquic would be slightly more robust against such issues, being implemented in userspace rather than kernelspace.

alexrabi · 2024-12-19T11:10:56Z

Alright, so here are some quick results from Mininet. This is with a Dell laptop (Latitude 7420, 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz, Linux Ubuntu 22.04.05 kernel 5.14.0-1048-oem). Seems like the general trends are reproducible.

ccalgo	Mpbs_S	pkt_sent	trains_s	retrans.	cwin	p_rate
2	6061.915101	712374	53363	1095	150755	8090590000
2	5147.571400	822327	61514	1282	150424	6005652173
2	5347.354288	822421	61560	1373	140626	6829250000
2	5160.694531	822263	61482	1217	134296	6027047244
2	5352.337893	822248	61507	1200	144586	5991727941
5	3789.168892	712562	249145	1085	188496	6285644891
5	4696.508843	821630	198869	587	178785	1631850605
5	2597.563774	822187	223876	1132	169207	1698051441
5	2807.170316	822027	236272	968	187299	1589616462
5	2438.591033	821913	247647	856	170272	1464113728

I did manually filter out a few outliers for the BBR measurements however, as the performance tanked to ~74Mbps in a few runs. I'll need to investigate that further.

huitema · 2024-12-19T23:43:20Z

Interesting results. Both Cubic and BBR show variations, but Cubic varies from 5 to 6 Gbps, about +- 10%, while BBR is at least 30% slower, and varies from 2.4 to 4.7 Gbps, about +- 33%. That's in line with what we see on Windows.

This may be due to a bug in my implementation, but the bug itself would be due to BBR being very complex. It may also be due to BBR being "model based", and a high bandwidth low latency path does not fit the model. IIRC, Google does use a very different profile of BBR in their data centers.

alexrabi · 2024-12-20T10:42:05Z

Comparing apples to oranges here, obviously, but here are some results using HTTPS (i.e. TCP) in the same setup.

TCP Cubic: 13.1Gbps, 12.5Gbps, 12.6Gbps, 13.5Gbps, 13.1Gbps
TCP BBR: 12.6Gbps, 11.6Gbps, 12.5Gbps, 12.4Gbps, 13.0Gbps

And here are the results for picoquic for the same run:

QUIC Cubic: 5.0Gbps, 5.2Gbps, 5.0Gbps, 5.2Gbps, 4.7Gbps
QUIC BBR: 3.7Gbps, 0.1Gbps, 0.1Gbps, 2.2Gbps, 2.2Gbps

huitema · 2024-12-20T21:37:26Z

Some precision -- you mentioned "using HTTPS (i.e. TCP)". Did you actually use HTTPs, including performing key negotiation, certificate verification, etc.?

alexrabi · 2024-12-22T09:13:21Z

Yes, this is using "actual" HTTPS. It's not really a "fair" comparison, since I did not adjust any of the offloading mechanisms which likely gives a massive edge to TCP.

huitema · 2024-12-23T07:34:11Z

Yes, the hardware offload probably explains a lot of the difference. Using GSO helps a lot, but the socket API is still the largest part of the CPU consumption. On the other hand, 5Gbps for a single CPU is not bad. Practical deployments may use several parallel processes, one per CPU, to get over that.

alexrabi · 2024-12-23T08:26:54Z

Indeed, getting 5Gbps on a single CPU is good, so I would not be too worried about the performance of picoquic in and of itself. Clearly, there is something off about the BBR implementation in picoquic though, as the throughput variability is too great. Why do we sometimes get as low as 0.1 Gbps in this test? Anecdotally, I haven't seen that behavior on the "first" run, only on subsequent runs. Restarting the server application between runs seems to avoid the issue.

alexrabi mentioned this issue Dec 18, 2024

Improving application limited tests for BBR #1812

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variability of BBR throughput #1758

Variability of BBR throughput #1758

huitema commented Sep 30, 2024 •

edited

Loading

huitema commented Oct 1, 2024

alexrabi commented Dec 7, 2024

huitema commented Dec 7, 2024

alexrabi commented Dec 8, 2024

alexrabi commented Dec 16, 2024

huitema commented Dec 16, 2024

alexrabi commented Dec 17, 2024

huitema commented Dec 17, 2024

huitema commented Dec 17, 2024

alexrabi commented Dec 18, 2024

huitema commented Dec 18, 2024 via email

joergdeutschmann-i7 commented Dec 18, 2024

alexrabi commented Dec 18, 2024

joergdeutschmann-i7 commented Dec 18, 2024

alexrabi commented Dec 19, 2024

alexrabi commented Dec 19, 2024

huitema commented Dec 19, 2024

alexrabi commented Dec 20, 2024

huitema commented Dec 20, 2024

alexrabi commented Dec 22, 2024

huitema commented Dec 23, 2024

alexrabi commented Dec 23, 2024

Variability of BBR throughput #1758

Variability of BBR throughput #1758

Comments

huitema commented Sep 30, 2024 • edited Loading

huitema commented Oct 1, 2024

alexrabi commented Dec 7, 2024

huitema commented Dec 7, 2024

alexrabi commented Dec 8, 2024

alexrabi commented Dec 16, 2024

huitema commented Dec 16, 2024

alexrabi commented Dec 17, 2024

huitema commented Dec 17, 2024

huitema commented Dec 17, 2024

alexrabi commented Dec 18, 2024

huitema commented Dec 18, 2024 via email

joergdeutschmann-i7 commented Dec 18, 2024

alexrabi commented Dec 18, 2024

joergdeutschmann-i7 commented Dec 18, 2024

alexrabi commented Dec 19, 2024

alexrabi commented Dec 19, 2024

huitema commented Dec 19, 2024

alexrabi commented Dec 20, 2024

huitema commented Dec 20, 2024

alexrabi commented Dec 22, 2024

huitema commented Dec 23, 2024

alexrabi commented Dec 23, 2024

huitema commented Sep 30, 2024 •

edited

Loading