Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variability of BBR throughput #1758

Open
huitema opened this issue Sep 30, 2024 · 22 comments
Open

Variability of BBR throughput #1758

huitema opened this issue Sep 30, 2024 · 22 comments

Comments

@huitema
Copy link
Collaborator

huitema commented Sep 30, 2024

A simple test of throughput can be done using picoquicdemo on local loopback on a Windows PC. (The same test on Unix devices is less interesting, because the loopback path on Linux is very different from what happens with actual network sockets.) The test works by starting a server on on terminal window using picoquicdemo.exe -p 4433 and running a client from another windows as picoquicdemo.exe -D -G {bbr|cubic} -n test ::1 4433 /1000000000. After running 5 tests using BBR and another 5 using Cubic on a Dell laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) we get the following results:

CC (on PC) Gbps P/trains Loss rate CWIN (p) Pacing (Gbps)
cubic 3.2 16 0.11% 878 226.5
cubic 3.1 17 0.05% 4477 645.9
cubic 2.9 16 0.09% 466 415.5
cubic 2.6 17 0.17% 359 750.9
cubic 3.1 17 0.11% 345 802.2
bbr 1.5 2 0.08% 137 2.3
bbr 2.4 6 0.05% 279 3.4
bbr 1.1 2 0.11% 269 1.8
bbr 1.8 3 0.08% 182 1.7
bbr 1.7 3 0.20% 273 2.7

For each connection, we listed:

  • The congestion control algorithm
  • The average throughput measured when sending 1 GB of data
  • The number of packets sent in a single sendmsg call, which we call packets per train,
  • The packet loss rate, defined as the number of packets repeated divided by total number of packets,
  • The observed congestion window, expressed in number of packets, observed at the end of the session
  • The pacing rate, observed at the end of the session

The obvious conclusion is that Cubic is much faster in that specific test, with an average data rate of about 3 Gbps, versus about 1.7 Gbps for BBR. We observe that much of the slow down in the BBR tests is due to the conservative pacing rate, leading to a small number of packets per "sendmsg" call. The pacing rate is the main control parameter in BBR, and it does not evolve to reflect the capacity of the loopback path.

We expect some variability between tests, for example because other programs and services running on the laptop may cause variations in available CPU. We do see some of that variability when using Cubic, with observed data rates between 2.6 and 3.2 Gbps, but we see a much higher variability using BBR, with data rates between 1.1 Gbps and 2.4 Gbps. This confirms other observations that small changes in the environment can produce big variations in BBR performance. We need to investigate the cause of these variations.

@huitema
Copy link
Collaborator Author

huitema commented Oct 1, 2024

Investigating BBR requires doing logging in a way that does not impact performance. The current options, qlog or even binary log, are not adequate. Turning the binary log option on for a transfer of 1GB, the data rate drops to 400Mbps while producing a 200MB file. Instead, we should try to log the key variables in memory, and save them to a csv file at the end of the connection.

@alexrabi
Copy link
Collaborator

alexrabi commented Dec 7, 2024

I am not sure whether this is related to this problem in particular or not, but I have also noticed some strange issues with BRR, likely related to the pacing rate.

In my scenario, I am using MP-QUIC together with QUIC datagrams in a VPN-like setup, where traffic is being tunneled across a MP-QUIC connection. If I let the traffic going into the tunnel to be split across all paths, everything seems to work well. However, splitting the traffic is not necessarily a good idea for all types of flows, because heterogeneous path characteristics may lead to excessive reordering, which in turn may lead to issues such as, for example, spurious retransmissions and unnecessary CWND reductions if the underlying traffic going into the tunnel is itself congestion controlled. It may therefore be preferable to limit the use of the number of MP-QUIC paths to just one in certain scenarios. However, limiting the use of paths (i.e. using the picoquic_mark_datagram_ready_path API to explicitly tell picoquic which path to send the datagrams on, while keeping the other paths "up" but inactive) seems to have some strange effects on the congestion control.

In one scenario, where I am sending UDP-based traffic with a constant bit rate (not congestion controlled) across the tunnel, I have noticed a massive amount of packet losses and very high latency for the few packets that actually made it through (~333ms RTT, when the base RTT is ~15ms), despite the fact that the data rate of the UDP flow is much lower than the capacity of the network. Again, I do not see the same behavior if I split the traffic across all paths, only when limiting the sending of the datagrams to one path. Setting up a connection with only a single path available (while still negotiating and agreeing to use MP-QUIC) also doesn't seem to display the same issue. Curiously, this problem only seems to happen on the server side as well. Using cubic on the server instead of BBR solves this issue, so I am fairly certain that it is not a problem with the tunnel framework, but rather that something is off with BBR; likely the pacing rate.

It could just be a coincidence of course, but the ~333ms RTT that I have been seeing seems oddly specific, as if some default value is being used for the pacing rate. Is the pacing rate calculated for the "inactive" paths being applied to ALL paths? And why does it only seem to occur on the server side?

@huitema
Copy link
Collaborator Author

huitema commented Dec 7, 2024

That's a bizarre result. The pacing rate is definitely per path. We would need some traces to understand what is happening in your scenario. Just to check, did you try setting the CC to "cubic" to see whether the issue persists?

@alexrabi
Copy link
Collaborator

alexrabi commented Dec 8, 2024

I will try to produce some traces to help you figure out what is going on. As I mentioned, setting CC to cubic does not show the same issue, so it is definitely something related to BBR.

@alexrabi
Copy link
Collaborator

Alright, so I have done some more testing, and it does seem to be reproducible in a few different environments. You can find logs here: logs.zip

Some interesting observations:

  • The problem occurs for BBRv1 and BBRv3 on the server side, but I have not encountered it when using any other CCA.
  • The choice of CCA on the client side does not appear to matter.
  • The problem only occurs when the path is application limited. However, if the path has been under load (i.e. the path capacity has been saturated at some point prior, e.g. by running an iperf test through the tunnel), everything seems to be working as expected even when running application limited traffic.
  • Looking at the logs, it seems like the client side is correctly identifying that it is being application limited, but that does not appear to be the case for the server side.

This may hint at some issue with the BBR startup phase, coupled with the server incorrectly assuming that it is not application limited?

@huitema
Copy link
Collaborator Author

huitema commented Dec 16, 2024

The "application limited" test with BBR is indeed more difficult than with Cubic or Reno. Cubic and Reno have a single control variable, the congestion window, so a simple test of bytes_in_flight < cwnd will return whether the traffic is app limited. BBR has two control variables. The main control variable is the pacing rate, but BBR also uses the congestion window, either as a safety to not send too much traffic in the absence of feedback, or as a short term limiter if packet losses were detected. Testing on the congestion window is imprecise, because in normal scenarios pacing dominates and the bytes in flight remain below the congestion window. Testing on the pacing rate is also imprecise, because pacing is implemented with a leaky bucket. There will be brief period in which the leaky bucket will be full, letting packets go without pacing, only limiting the last packet out of a train -- but that can happen whether the application is "limiting" or not. For example, an application that sends a large frame periodically will experience pacing, even though the average traffic is well below capacity.

That's a bit of a mess. Suggestions are welcome.

@alexrabi
Copy link
Collaborator

How about using the RTT as a test? BBR is different from Cubic or Reno in that it actively probes for the "non-congested" round-trip propagation time ("min_rtt") by attempting to drain the queue from the bottleneck. If the sender is application limited, it should also not contribute to any queue build-up at the bottleneck, and therefore the SRTT should not differ significantly from the min_rtt. This is not fool proof, of course, as the RTT probing phase is only done periodically. Changes to path characteristics, such as increased latency due to mobility or via the addition of competing flows at the bottleneck, could therefore be misinterpreted as no longer being application limited. It might be worth considering combining a few different tests using different metrics to determine whether a path is application or not. I.e. if both the pacing rate method and the above described method indicates that we are not application limited, only then should we consider us to no longer being application limited.

@huitema
Copy link
Collaborator Author

huitema commented Dec 17, 2024

The problem is time span. Suppose at application that sends 5Mbps of traffic on a 20 Mbps path -- kind of the definition of "app limited". Suppose now that the 5 Mbps of traffic consists of periodic video frames, i.e., one large message 30 times per second. That message will be sent in 8 or 9 ms -- 1/4th of the link capability. There will be no long term queue, but there will be an 8 or 9 ms queue built up for each frame. We can have a transient measurement that indicates saturation when the long term measurement does not.

@huitema
Copy link
Collaborator Author

huitema commented Dec 17, 2024

But the handling of app limited is really a different issue. The issue here was opened after test of transfers on a loopback interface, when the application was trying to send 1GB or 10BG of data -- definitely not an app limited scenario. The issue is analyzed in detail in the blog "Loopback ACK delays cause Cubic slowdowns." The Cubic issue was traced to bad behavior of the "ACK Frequency" implementation when running over very low latency paths, such as the loopback path. Fixing that fixed a lot of the BBR issues, but some remain. BBR is very complex, and thus hard to tune.

@alexrabi
Copy link
Collaborator

Indeed, these seem to be different issues entirely. I will create a separate issue for the application limited scenarios.

Speaking of this issue at hand, you say that the test is less interesting to run the test on Linux since using the loopback interface is very different from what happens with actual network sockets. Perhaps the issue could be replicated on Linux using network namespaces instead (e.g. Docker containers, Mininet), since that should actually make the packets go through the kernel "normally"?

@huitema
Copy link
Collaborator Author

huitema commented Dec 18, 2024 via email

@joergdeutschmann-i7
Copy link

Very interesting discussion here. I just wanted to chime in regarding the following:

Perhaps the issue could be replicated on Linux using network namespaces instead (e.g. Docker containers, Mininet), since that should actually make the packets go through the kernel "normally"?

I think also lightweight virtualization may impact your timings. Example (pdf).
NetEm can be really fun, too. (A colleague of mine recently fixed some NetEm issues.)
Out of helplessness, I sometimes do weird cabling setups in order to have more "realistic" topologies. Not sure if this is really a good approach. Also my colleague created (yet another?) testbed framework based on full operating system virtualization, but we've not evaluated detailed packet timings using that framework, yet.

@alexrabi
Copy link
Collaborator

I think also lightweight virtualization may impact your timings. Example (pdf).
NetEm can be really fun, too. (A colleague of mine recently fixed some NetEm issues.)

This seems to be an issue with the netem qdisc rather than Mininet itself, no? To emulate a "loopback" interface this way we would not need to impose any artificial delays nor artificial bandwidth (i.e. netem would not be required), as the CPU should be the main bottleneck, so I am not sure how applicable this is problem is for this particular scenario. Good to keep in mind though!

@joergdeutschmann-i7
Copy link

Mininet relies on NetEm according to the first reference and my knowledge. You're right, if you don't need artificial link emulation this might all not matter (to me it's then more testing of the implementation performance than congestion control performance), but maybe you also want to test the congestion control for different types of paths.

@alexrabi
Copy link
Collaborator

Mininet only uses the netem for emulating latency (the first reference even says as much). If you do not add any artificial latency to the links, then Mininet can use any "normal" qdisc such as fq or fq-codel. In any case, one has to be careful not to draw too many conclusions from emulated results, you are absolutely right.

As for the particular scenario in the first reference, I would argue that picoquic would be slightly more robust against such issues, being implemented in userspace rather than kernelspace.

@alexrabi
Copy link
Collaborator

Alright, so here are some quick results from Mininet. This is with a Dell laptop (Latitude 7420, 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz, Linux Ubuntu 22.04.05 kernel 5.14.0-1048-oem). Seems like the general trends are reproducible.

ccalgo Mpbs_S pkt_sent trains_s retrans. cwin p_rate
2 6061.915101 712374 53363 1095 150755 8090590000
2 5147.571400 822327 61514 1282 150424 6005652173
2 5347.354288 822421 61560 1373 140626 6829250000
2 5160.694531 822263 61482 1217 134296 6027047244
2 5352.337893 822248 61507 1200 144586 5991727941
5 3789.168892 712562 249145 1085 188496 6285644891
5 4696.508843 821630 198869 587 178785 1631850605
5 2597.563774 822187 223876 1132 169207 1698051441
5 2807.170316 822027 236272 968 187299 1589616462
5 2438.591033 821913 247647 856 170272 1464113728

I did manually filter out a few outliers for the BBR measurements however, as the performance tanked to ~74Mbps in a few runs. I'll need to investigate that further.

@huitema
Copy link
Collaborator Author

huitema commented Dec 19, 2024

Interesting results. Both Cubic and BBR show variations, but Cubic varies from 5 to 6 Gbps, about +- 10%, while BBR is at least 30% slower, and varies from 2.4 to 4.7 Gbps, about +- 33%. That's in line with what we see on Windows.

This may be due to a bug in my implementation, but the bug itself would be due to BBR being very complex. It may also be due to BBR being "model based", and a high bandwidth low latency path does not fit the model. IIRC, Google does use a very different profile of BBR in their data centers.

@alexrabi
Copy link
Collaborator

Comparing apples to oranges here, obviously, but here are some results using HTTPS (i.e. TCP) in the same setup.

TCP Cubic: 13.1Gbps, 12.5Gbps, 12.6Gbps, 13.5Gbps, 13.1Gbps
TCP BBR: 12.6Gbps, 11.6Gbps, 12.5Gbps, 12.4Gbps, 13.0Gbps

And here are the results for picoquic for the same run:

QUIC Cubic: 5.0Gbps, 5.2Gbps, 5.0Gbps, 5.2Gbps, 4.7Gbps
QUIC BBR: 3.7Gbps, 0.1Gbps, 0.1Gbps, 2.2Gbps, 2.2Gbps

@huitema
Copy link
Collaborator Author

huitema commented Dec 20, 2024

Some precision -- you mentioned "using HTTPS (i.e. TCP)". Did you actually use HTTPs, including performing key negotiation, certificate verification, etc.?

@alexrabi
Copy link
Collaborator

Yes, this is using "actual" HTTPS. It's not really a "fair" comparison, since I did not adjust any of the offloading mechanisms which likely gives a massive edge to TCP.

@huitema
Copy link
Collaborator Author

huitema commented Dec 23, 2024

Yes, the hardware offload probably explains a lot of the difference. Using GSO helps a lot, but the socket API is still the largest part of the CPU consumption. On the other hand, 5Gbps for a single CPU is not bad. Practical deployments may use several parallel processes, one per CPU, to get over that.

@alexrabi
Copy link
Collaborator

Indeed, getting 5Gbps on a single CPU is good, so I would not be too worried about the performance of picoquic in and of itself. Clearly, there is something off about the BBR implementation in picoquic though, as the throughput variability is too great. Why do we sometimes get as low as 0.1 Gbps in this test? Anecdotally, I haven't seen that behavior on the "first" run, only on subsequent runs. Restarting the server application between runs seems to avoid the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants