-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak when parsing many Bloblang templates #193
Comments
Hey. Thanks for this super thorough investigation! Glad to see you found the root cause and a reasonable workaround. Since bloblang is quite stable, I'm hesitant to change things around too much (at least outside of major version changes). Would you suggest we disable memory profiling by default when building + distributing Bento? Also, would you be able to share some more details about what bloblang queries you're running and how long it typically takes for the pod to be OOM killed? |
I totally understand not wanting to make changes -- this is an odd problem triggered by at non-standard use case of Bloblang. To fix I think the Bloblang parser would need to be effectively rewritten with a strong focus on limiting memory allocations. A very large undertaking, and at the end of the day not likely to make a huge difference for the primary Bloblang use case. My primary goal in posting is to help others save time if they end up hitting this edge case -- issues seemed the best place to put it. I could see a simple warning in the docs, but I don't think much more is reasonable on the Bento/Bloblang side of things at this point. I highly doubt it is necessary to disable memory profiling for Bento itself, as running does not likely involve processing a ton of templates. I could see very large Bloblang files causing a higher-than-expected memory footprint. The real problem is when the process using Bloblang is long-running, template parsing happens regularly or templates are very large, and the environment is memory-constrained (this is pretty much our use-case). We are currently using Bloblang to facilitate data transformation while sitting between various different APIs. Each available integration we have contains a different set of Bloblang templates to normalize incoming data into a common format. Some of these templates are dynamically generated based on how the integration is configured. We do cache templates for a time, but release them when they are not used after a while. When the API is accessed again the template is parsed again. We may want to change this behavior to account for this problem, but with mapping customization available we have a potentially unbounded number of templates that could be required/parsed. As for our infrastructure, we would parse somewhere on the order of 50+ templates that looked similar to the included sample in any given hour depending on usage. Our pods could allocate between 200-500MB of memory and it took 2-3 days before exhausting that and OOMing. Sample Bloblang template, with specifics genericized/removed
|
Overview
We have discovered that when Bloblang parses a large number of templates and Go memory profiling is enabled, there is a memory leak in Go's internal profiling allocations. Specifically, the size of
BuckHashSys
grows steadily. The more complex the parsed templates, the faster the memory profiling buckets grow. However, even very very basic templates cause this.Investigation
We noticed this memory leak in our application which uses Bloblang templates to manage data transformation. We use a number of different templates throughout our application. We cache the templates for 5 minutes, but then release the memory until the transformation in questions is required again. Over time we noticed memory usage growing for the application, however memory in the heap (HeapAlloc) did not grow, only the overall memory usage by the container hosting the application.
We finally discovered that we could see the leak in Go's
BuckHashSys
measurement. Go core describes this as "bytes of memory in profiling bucket hash tables."To determine where this was coming from, we isolated portions of the application and tested them until we discovered Bloblang parsing as the final culprit causing the leak. This allowed us to create a minimal demonstration case to show the leak.
Two important things to note in this example:
First, memory profiling must be on. In this case, it is enabled by setting
runtime.MemProfileRate = 512 * 1024
(this is the value found in Go core). However this is not the only way. We found that by simply importingnet/http/ppfrof
it would activate memory profiling. Pprof does not need to be added to a serve mux, it just needs to get imported -- not uncommon in various packages out there. You can also make a very simple test file that runsDoTest
. When running throughgo test
, you can remove theruntime.MemProfileRate = 512 * 1024
line and the leak will still show because profiling is active when testing.Second, this uses a very very simple templete
root = {}
. The large number of iterations is required because the leak is slower when simpler templates are used. Our templates tend to be much more complex, and the leak easily shows up when taking a snapshot of theBuckSysHash
every iteration, rather than every 50 as in this example. Far fewer iterations are required. But for the sake of demonstration I wanted to show the leak is present even in the most basic of templates. Output of this application looks something like this:The number on each line showing a steady increase in the memory used in
BuckSysHash
.Conclusion
Reviewing the total allocs in our application, we saw that while the heap usage remained steady, Bloblang template parsing allocates a very large amount of memory. Basically there is a large amount of memory churn. In the example above, the heap allocations after running were at around 2MB. Total allocations for the application are 122MB. Again, this becomes more extreme as the template increases in complexity. At the end of parsing the garbage collector can easily clean this up, but that does not touch the records stored in memory profiling buckets. The memory used there never goes away.
This lead to the conclusion that there is an inherent leak in Go's memory profiling buckets, but for most applications this is never noticed due to the extremely low amount of buckets needed to store the data. However, with the memory churn introduced by parsing multiple Bloblang templates over time, enough allocations are performed that memory profiling buckets fill quickly, and new ones are allocated with unforeseen frequency. It is the combination of the two systems which results in a memory leak.
Workaround
Fixing this would require either Bloblang to modify its parsing behavior, optimizing memory usage and bringing down the overall memory churn, or Go needs to somehow change the way it is recording memory profiling data. Neither seems likely. In the meantime, you can manually turn memory profiling off. This is accomplished by setting the memory profiling to 0 in an application:
This should be done as early as possible, as noted in Go core:
The side effect is you will not longer have access to memory profile stats because Go is not keeping them. We ended up turning profiling off by default, but expose an application flag allowing it to be turned on. If we do need to track a memory problem, we can redeploy with memory profiling enabled and access profiles. The leak is slow enough that running the application for a time with memory profiling on is not an issue. When done we redeploy once more to turn it back off, otherwise the pod eventually runs out of memory.
The text was updated successfully, but these errors were encountered: