-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch default protocol to v3 #152
Comments
+1, awesome, thanks. |
Resolved with #11595 |
I have realized a few followups are needed: #155 |
@Pchelolo after testing the latest master with these changes I noticed a big performance degradation with V2 API. Under light load the p999 latency went from 15ms => 22ms which is not ideal as we try to keep p999 under 20ms to avoid as many fail-open rate limit requests. I did a CPU profiling with pprof and suspect it's due to more time being spent in the converting of v2 request to v3: Is there a better way we can convert the request? As v2 is to be officially deprecated at end of this year, moving to v3 is good (and we plan to move our Envoy configs to v3 as well) but I feel there are many orgs still on v2. |
@ysawa0 thank you so much for testing. I have created #155 as a followup to this, which switches converting v2->v3 to be much more explicit. Originally that was done because one of the fields was renamed, thus serializing/unserializing doesn't work any more, it wasn't covered by tests, thus forgotten. But I feel it could be beneficial for performance as a side effect. If you don't mind testing that version, given that you have perf tests setup, it would be great. If you don't have time for that, please let me know, I'll set myself up for performance test and see if #155 mitigates this problem. A performance degradation is expected, since we now need to do the conversion and no matter how the conversion is done, it will cost something, but 7ms increase indeed seems too high of a price to pay. Perhaps if we don't come up with a solution, #153 should be reverted and only reapplied after v2 was deprecated? |
Thanks @Pchelolo I tried out the PR and that looks better in pprof. Let me do more long-term tests and get back to you with a definitive answer! |
@ysawa0 thank you! |
#155 did indeed fix a lot of the issues but after testing ratelimit before the v2/v3 api code was even merged at all, I still see performance degradation. I compared two builds of ratelimit, one before #137 was merged and one after. |
#137 make redis client do pipelining more efficiently to improve throughput (by buffering as much as possible commands from concurrent requests in a time window), so it is possible to introduce more latency. Please try to set
I will try to investigate this. |
@caitong93 I tested with
this would turn the feature completely off, correct? |
Yes, in this case, redis commands will be sent immediately |
Apologizes for delay in this update. Been trying to find the time to do this test the right way. Here's what I saw after trying 3 builds. For Redis, I let an Envoy sidecar do the routing via the Redis filter and turned off pipelining. All reported latencies are the p999 under ~2k rps. With latest (after #158 merge, commit d7d563a) 20ms latency. Not sure if anyone else is having the same experience as us but that's what I recorded. Of course I could be configuring something incorrectly or it's due to some uniqueness with our system. In that case, a second set of eyes would be great! I'll do some profiling to see if there's any low hanging fruit to optimize. I think one thing could be replacing logrus with something more performant like zap. |
Comparing benchmark stats for no pipeline case (by run https://github.com/envoyproxy/ratelimit/blob/master/test/redis/bench_test.go#L68), I do observe cpu performance degradation.
Actually before #137 , the client use explicitly pipelining, this may lead to different main reasons, will try to send a PR to optimize in this week |
@caitong93 Beautiful. I've run #163 vs pre #137 and they are reporting basically same p999 latency and timeouts! Thank you! |
Rational
In order to support newer features of the ratelimit protocol (like rate limit overrides), we’d need the service to use v3 protocol instead of v2 protocol. Thus, we propose to upgrade the ratelimiter core to v3 protocol. Since there doesn’t seem to be any real changes between v2 and v3, this seems to be as easy as replacing the imports and referencing newer versions in the data plane.
However, to support complete compatibility with v2, we’d need to have a legacy layer, just like it’s done for v1 now. Supporting 3 versions of the protocol with 2 legacies seems like an overkill, so we propose to simultaneously drop support for v1 ratelimit.proto. Seems like it should be ok given that the release schedule proposed doing so in Q32018 [2]
Proposal
Thus, the concrete proposal:
Protocol v1 support is dropped
Protocol v2 becomes legacy and converted to v3. Given that the protocols are only different in message names, this should have almost no performance penalty
The service core is updated to use protocol v3
We intend to contribute all the code. We are also planning to use v3 protocol in production in our own fork in the beginning, thus we can provide some battle testing. What do you think?
The text was updated successfully, but these errors were encountered: