-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variability of BBR throughput #1758
Comments
Investigating BBR requires doing logging in a way that does not impact performance. The current options, qlog or even binary log, are not adequate. Turning the binary log option on for a transfer of 1GB, the data rate drops to 400Mbps while producing a 200MB file. Instead, we should try to log the key variables in memory, and save them to a csv file at the end of the connection. |
I am not sure whether this is related to this problem in particular or not, but I have also noticed some strange issues with BRR, likely related to the pacing rate. In my scenario, I am using MP-QUIC together with QUIC datagrams in a VPN-like setup, where traffic is being tunneled across a MP-QUIC connection. If I let the traffic going into the tunnel to be split across all paths, everything seems to work well. However, splitting the traffic is not necessarily a good idea for all types of flows, because heterogeneous path characteristics may lead to excessive reordering, which in turn may lead to issues such as, for example, spurious retransmissions and unnecessary CWND reductions if the underlying traffic going into the tunnel is itself congestion controlled. It may therefore be preferable to limit the use of the number of MP-QUIC paths to just one in certain scenarios. However, limiting the use of paths (i.e. using the In one scenario, where I am sending UDP-based traffic with a constant bit rate (not congestion controlled) across the tunnel, I have noticed a massive amount of packet losses and very high latency for the few packets that actually made it through (~333ms RTT, when the base RTT is ~15ms), despite the fact that the data rate of the UDP flow is much lower than the capacity of the network. Again, I do not see the same behavior if I split the traffic across all paths, only when limiting the sending of the datagrams to one path. Setting up a connection with only a single path available (while still negotiating and agreeing to use MP-QUIC) also doesn't seem to display the same issue. Curiously, this problem only seems to happen on the server side as well. Using cubic on the server instead of BBR solves this issue, so I am fairly certain that it is not a problem with the tunnel framework, but rather that something is off with BBR; likely the pacing rate. It could just be a coincidence of course, but the ~333ms RTT that I have been seeing seems oddly specific, as if some default value is being used for the pacing rate. Is the pacing rate calculated for the "inactive" paths being applied to ALL paths? And why does it only seem to occur on the server side? |
That's a bizarre result. The pacing rate is definitely per path. We would need some traces to understand what is happening in your scenario. Just to check, did you try setting the CC to "cubic" to see whether the issue persists? |
I will try to produce some traces to help you figure out what is going on. As I mentioned, setting CC to cubic does not show the same issue, so it is definitely something related to BBR. |
Alright, so I have done some more testing, and it does seem to be reproducible in a few different environments. You can find logs here: logs.zip Some interesting observations:
This may hint at some issue with the BBR startup phase, coupled with the server incorrectly assuming that it is not application limited? |
The "application limited" test with BBR is indeed more difficult than with Cubic or Reno. Cubic and Reno have a single control variable, the congestion window, so a simple test of That's a bit of a mess. Suggestions are welcome. |
How about using the RTT as a test? BBR is different from Cubic or Reno in that it actively probes for the "non-congested" round-trip propagation time ("min_rtt") by attempting to drain the queue from the bottleneck. If the sender is application limited, it should also not contribute to any queue build-up at the bottleneck, and therefore the SRTT should not differ significantly from the min_rtt. This is not fool proof, of course, as the RTT probing phase is only done periodically. Changes to path characteristics, such as increased latency due to mobility or via the addition of competing flows at the bottleneck, could therefore be misinterpreted as no longer being application limited. It might be worth considering combining a few different tests using different metrics to determine whether a path is application or not. I.e. if both the pacing rate method and the above described method indicates that we are not application limited, only then should we consider us to no longer being application limited. |
The problem is time span. Suppose at application that sends 5Mbps of traffic on a 20 Mbps path -- kind of the definition of "app limited". Suppose now that the 5 Mbps of traffic consists of periodic video frames, i.e., one large message 30 times per second. That message will be sent in 8 or 9 ms -- 1/4th of the link capability. There will be no long term queue, but there will be an 8 or 9 ms queue built up for each frame. We can have a transient measurement that indicates saturation when the long term measurement does not. |
But the handling of app limited is really a different issue. The issue here was opened after test of transfers on a loopback interface, when the application was trying to send 1GB or 10BG of data -- definitely not an app limited scenario. The issue is analyzed in detail in the blog "Loopback ACK delays cause Cubic slowdowns." The Cubic issue was traced to bad behavior of the "ACK Frequency" implementation when running over very low latency paths, such as the loopback path. Fixing that fixed a lot of the BBR issues, but some remain. BBR is very complex, and thus hard to tune. |
Indeed, these seem to be different issues entirely. I will create a separate issue for the application limited scenarios. Speaking of this issue at hand, you say that the test is less interesting to run the test on Linux since using the loopback interface is very different from what happens with actual network sockets. Perhaps the issue could be replicated on Linux using network namespaces instead (e.g. Docker containers, Mininet), since that should actually make the packets go through the kernel "normally"? |
On 12/18/2024 12:26 AM, alexrabi wrote:
Indeed, these seem to be different issues entirely. I will create a separate issue for the application limited scenarios.
Speaking of this issue at hand, you say that the test is less interesting to run the test on Linux since using the loopback interface is very different from what happens with actual network sockets. Perhaps the issue could be replicated on Linux using network namespaces instead (e.g. Docker containers, Mininet), since that should actually make the packets go through the kernel "normally"?
Maybe. I have no experience with that, but if you want to step in, you
are welcome!
…-- Christian Huitema
|
Very interesting discussion here. I just wanted to chime in regarding the following:
I think also lightweight virtualization may impact your timings. Example (pdf). |
This seems to be an issue with the netem qdisc rather than Mininet itself, no? To emulate a "loopback" interface this way we would not need to impose any artificial delays nor artificial bandwidth (i.e. netem would not be required), as the CPU should be the main bottleneck, so I am not sure how applicable this is problem is for this particular scenario. Good to keep in mind though! |
Mininet relies on NetEm according to the first reference and my knowledge. You're right, if you don't need artificial link emulation this might all not matter (to me it's then more testing of the implementation performance than congestion control performance), but maybe you also want to test the congestion control for different types of paths. |
Mininet only uses the netem for emulating latency (the first reference even says as much). If you do not add any artificial latency to the links, then Mininet can use any "normal" qdisc such as fq or fq-codel. In any case, one has to be careful not to draw too many conclusions from emulated results, you are absolutely right. As for the particular scenario in the first reference, I would argue that picoquic would be slightly more robust against such issues, being implemented in userspace rather than kernelspace. |
Alright, so here are some quick results from Mininet. This is with a Dell laptop (Latitude 7420, 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz, Linux Ubuntu 22.04.05 kernel 5.14.0-1048-oem). Seems like the general trends are reproducible.
I did manually filter out a few outliers for the BBR measurements however, as the performance tanked to ~74Mbps in a few runs. I'll need to investigate that further. |
Interesting results. Both Cubic and BBR show variations, but Cubic varies from 5 to 6 Gbps, about +- 10%, while BBR is at least 30% slower, and varies from 2.4 to 4.7 Gbps, about +- 33%. That's in line with what we see on Windows. This may be due to a bug in my implementation, but the bug itself would be due to BBR being very complex. It may also be due to BBR being "model based", and a high bandwidth low latency path does not fit the model. IIRC, Google does use a very different profile of BBR in their data centers. |
Comparing apples to oranges here, obviously, but here are some results using HTTPS (i.e. TCP) in the same setup. TCP Cubic: 13.1Gbps, 12.5Gbps, 12.6Gbps, 13.5Gbps, 13.1Gbps And here are the results for picoquic for the same run: QUIC Cubic: 5.0Gbps, 5.2Gbps, 5.0Gbps, 5.2Gbps, 4.7Gbps |
Some precision -- you mentioned "using HTTPS (i.e. TCP)". Did you actually use HTTPs, including performing key negotiation, certificate verification, etc.? |
Yes, this is using "actual" HTTPS. It's not really a "fair" comparison, since I did not adjust any of the offloading mechanisms which likely gives a massive edge to TCP. |
Yes, the hardware offload probably explains a lot of the difference. Using GSO helps a lot, but the socket API is still the largest part of the CPU consumption. On the other hand, 5Gbps for a single CPU is not bad. Practical deployments may use several parallel processes, one per CPU, to get over that. |
Indeed, getting 5Gbps on a single CPU is good, so I would not be too worried about the performance of picoquic in and of itself. Clearly, there is something off about the BBR implementation in picoquic though, as the throughput variability is too great. Why do we sometimes get as low as 0.1 Gbps in this test? Anecdotally, I haven't seen that behavior on the "first" run, only on subsequent runs. Restarting the server application between runs seems to avoid the issue. |
A simple test of throughput can be done using
picoquicdemo
on local loopback on a Windows PC. (The same test on Unix devices is less interesting, because the loopback path on Linux is very different from what happens with actual network sockets.) The test works by starting a server on on terminal window usingpicoquicdemo.exe -p 4433
and running a client from another windows aspicoquicdemo.exe -D -G {bbr|cubic} -n test ::1 4433 /1000000000
. After running 5 tests using BBR and another 5 using Cubic on a Dell laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) we get the following results:For each connection, we listed:
The obvious conclusion is that Cubic is much faster in that specific test, with an average data rate of about 3 Gbps, versus about 1.7 Gbps for BBR. We observe that much of the slow down in the BBR tests is due to the conservative pacing rate, leading to a small number of packets per "sendmsg" call. The pacing rate is the main control parameter in BBR, and it does not evolve to reflect the capacity of the loopback path.
We expect some variability between tests, for example because other programs and services running on the laptop may cause variations in available CPU. We do see some of that variability when using Cubic, with observed data rates between 2.6 and 3.2 Gbps, but we see a much higher variability using BBR, with data rates between 1.1 Gbps and 2.4 Gbps. This confirms other observations that small changes in the environment can produce big variations in BBR performance. We need to investigate the cause of these variations.
The text was updated successfully, but these errors were encountered: