Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when parsing many Bloblang templates #193

Open
lkwdwrd opened this issue Dec 20, 2024 · 2 comments
Open

Memory leak when parsing many Bloblang templates #193

lkwdwrd opened this issue Dec 20, 2024 · 2 comments

Comments

@lkwdwrd
Copy link

lkwdwrd commented Dec 20, 2024

Overview

We have discovered that when Bloblang parses a large number of templates and Go memory profiling is enabled, there is a memory leak in Go's internal profiling allocations. Specifically, the size of BuckHashSys grows steadily. The more complex the parsed templates, the faster the memory profiling buckets grow. However, even very very basic templates cause this.

Investigation

We noticed this memory leak in our application which uses Bloblang templates to manage data transformation. We use a number of different templates throughout our application. We cache the templates for 5 minutes, but then release the memory until the transformation in questions is required again. Over time we noticed memory usage growing for the application, however memory in the heap (HeapAlloc) did not grow, only the overall memory usage by the container hosting the application.

We finally discovered that we could see the leak in Go's BuckHashSys measurement. Go core describes this as "bytes of memory in profiling bucket hash tables."

To determine where this was coming from, we isolated portions of the application and tested them until we discovered Bloblang parsing as the final culprit causing the leak. This allowed us to create a minimal demonstration case to show the leak.

package main

import (
	"fmt"
	"runtime"

	"github.com/warpstreamlabs/bento/public/bloblang"
)

func DoTest() {
	bEnv := bloblang.NewEnvironment()
	runtime.MemProfileRate = 512 * 1024
	for i := 0; i < 1000; i++ {
		content := `root = {}`
		_, err := bEnv.Parse(content)
		if err != nil {
			panic(err)
		}

		if i%50 == 0 {
			var m runtime.MemStats
			runtime.ReadMemStats(&m)
			fmt.Println(m.BuckHashSys)
		}
	}
}

func main() {
	DoTest()
}

Two important things to note in this example:

First, memory profiling must be on. In this case, it is enabled by setting runtime.MemProfileRate = 512 * 1024 (this is the value found in Go core). However this is not the only way. We found that by simply importing net/http/ppfrof it would activate memory profiling. Pprof does not need to be added to a serve mux, it just needs to get imported -- not uncommon in various packages out there. You can also make a very simple test file that runs DoTest. When running through go test, you can remove the runtime.MemProfileRate = 512 * 1024 line and the leak will still show because profiling is active when testing.

Second, this uses a very very simple templete root = {}. The large number of iterations is required because the leak is slower when simpler templates are used. Our templates tend to be much more complex, and the leak easily shows up when taking a snapshot of the BuckSysHash every iteration, rather than every 50 as in this example. Far fewer iterations are required. But for the sake of demonstration I wanted to show the leak is present even in the most basic of templates. Output of this application looks something like this:

./main
1444041
1447065
1450945
1454241
1457713
1461041
1463185
1467825
1472121
1475385
1479241
1481505
1486065
1491417
1495233
1498961
1503529
1505457
1508561
1509977

The number on each line showing a steady increase in the memory used in BuckSysHash.

Conclusion

Reviewing the total allocs in our application, we saw that while the heap usage remained steady, Bloblang template parsing allocates a very large amount of memory. Basically there is a large amount of memory churn. In the example above, the heap allocations after running were at around 2MB. Total allocations for the application are 122MB. Again, this becomes more extreme as the template increases in complexity. At the end of parsing the garbage collector can easily clean this up, but that does not touch the records stored in memory profiling buckets. The memory used there never goes away.

This lead to the conclusion that there is an inherent leak in Go's memory profiling buckets, but for most applications this is never noticed due to the extremely low amount of buckets needed to store the data. However, with the memory churn introduced by parsing multiple Bloblang templates over time, enough allocations are performed that memory profiling buckets fill quickly, and new ones are allocated with unforeseen frequency. It is the combination of the two systems which results in a memory leak.

Workaround

Fixing this would require either Bloblang to modify its parsing behavior, optimizing memory usage and bringing down the overall memory churn, or Go needs to somehow change the way it is recording memory profiling data. Neither seems likely. In the meantime, you can manually turn memory profiling off. This is accomplished by setting the memory profiling to 0 in an application:

runtime.MemProfileRate = 0

This should be done as early as possible, as noted in Go core:

Programs that change the memory profiling rate should do so just once, as early as possible in the execution of the program (for example, at the beginning of main).

The side effect is you will not longer have access to memory profile stats because Go is not keeping them. We ended up turning profiling off by default, but expose an application flag allowing it to be turned on. If we do need to track a memory problem, we can redeploy with memory profiling enabled and access profiles. The leak is slow enough that running the application for a time with memory profiling on is not an issue. When done we redeploy once more to turn it back off, otherwise the pod eventually runs out of memory.

@gregfurman
Copy link
Collaborator

Hey. Thanks for this super thorough investigation! Glad to see you found the root cause and a reasonable workaround. Since bloblang is quite stable, I'm hesitant to change things around too much (at least outside of major version changes).

Would you suggest we disable memory profiling by default when building + distributing Bento?

Also, would you be able to share some more details about what bloblang queries you're running and how long it typically takes for the pod to be OOM killed?

@lkwdwrd
Copy link
Author

lkwdwrd commented Jan 8, 2025

I totally understand not wanting to make changes -- this is an odd problem triggered by at non-standard use case of Bloblang. To fix I think the Bloblang parser would need to be effectively rewritten with a strong focus on limiting memory allocations. A very large undertaking, and at the end of the day not likely to make a huge difference for the primary Bloblang use case. My primary goal in posting is to help others save time if they end up hitting this edge case -- issues seemed the best place to put it. I could see a simple warning in the docs, but I don't think much more is reasonable on the Bento/Bloblang side of things at this point.

I highly doubt it is necessary to disable memory profiling for Bento itself, as running does not likely involve processing a ton of templates. I could see very large Bloblang files causing a higher-than-expected memory footprint. The real problem is when the process using Bloblang is long-running, template parsing happens regularly or templates are very large, and the environment is memory-constrained (this is pretty much our use-case). We are currently using Bloblang to facilitate data transformation while sitting between various different APIs. Each available integration we have contains a different set of Bloblang templates to normalize incoming data into a common format. Some of these templates are dynamically generated based on how the integration is configured. We do cache templates for a time, but release them when they are not used after a while. When the API is accessed again the template is parsed again. We may want to change this behavior to account for this problem, but with mapping customization available we have a potentially unbounded number of templates that could be required/parsed.

As for our infrastructure, we would parse somewhere on the order of 50+ templates that looked similar to the included sample in any given hour depending on usage. Our pods could allocate between 200-500MB of memory and it took 2-3 days before exhausting that and OOMing.

Sample Bloblang template, with specifics genericized/removed

root = {
    "unmapped": this
}

root.objectId = 3001

root.actionId = match this.tag {
    this.contains("created") => 1
    this.contains("deleted") => 2
    this.contains("changed") => 3
    _ => 0
}
root.actionName = match root.actionId {
    1 => "Create"
    2 => "Delete"
    3 => "Change"
    _ => "Unknown"
}

root.requestStatusId = match this.request_status {
    "success" => 1
    "failure" => 2
    "pending" => 3
    _ => 0
}
root.requestStatus = match root.requestStatusId {
    1 => "Success"
    2 => "Failure"
    3 => "Pending"
    _ => "Unknown"
}

root.severityId = match this.evt_severity {
    "informational" => 1
    "low" => 2
    "medium" => 3
    "high" => 4
    "critical" => 5
    _ => deleted()
}
root.severity = match root.severityId {
    1 => "Information"
    2 => "Low"
    3 => "Medium"
    4 => "High"
    5 => "Critical"
    _ => deleted()
}


root.appName = this.app_name | deleted()
root.ua = this.userAgent | deleted()
root.ep.hostname = this.host | deleted()
root.ep.ip = this.ip | deleted()
root.ep.mac = this.mac | deleted()
root.ep.name = this.name | deleted()
root.ep.org.name = this.business_unit | deleted()
root.ep.port = this.port | deleted()
root.ep.zone = this.zone | deleted()
root.domain = this.domain1 | this.domain2 | deleted()
root.account.uid = this.user_id | this.user_alt_it | deleted()
root.account.name = this.user | deleted()
root.account.org.name = this.user_business_unit | deleted()
root.account.typeId = if (this.user_cat | this.user_alt_cat) != null { 0 } else { deleted() }
root.account.type = if (this.user_cat | this.user_alt_cat) != null { (this.user_cat | this.user_alt_cat) } else { deleted() }
root.msg = this.message deleted()
root.datetime = this.evt_time.ts_unix_milli() | deleted()
root.duration = this.duration | deleted()
root.metadata.uid = this.evt_id | deleted()
root.metadata.eventCode = this.signature | deleted()
root.metadata.logName = this.source | deleted()
root.metadata.logProvider = this.index | deleted()
root.metadata.logVersion = this.st | deleted()
root.metadata.processed = this.pt.ts_unix_milli() | deleted()
root.metadata.productName = "Log Provider"
root.metadata.productVendorName = "ACME"
root.metadata.version = "x.x.x"
root.metadata.labels = this.tag | deleted()

# remove mapped and private values from unmapped
root.unmapped = root.unmapped.without(
  "tag",
  "request_status",
  "evt_severity",
  "app_name",
  "user_agent",
  "host",
  "ip",
  "mac",
  "name",
  "business_unit",
  "port",
  "zone",
  "domain1",
  "domain2",
  "user_id",
  "user_alt_id",
  "user",
  "user_business_unit",
  "user_cat",
  "user_alt_cat",
  "message",
  "evt_time",
  "duration",
  "evt_id",
  "signature",
  "source",
  "index",
  "st",
  "pt",

  "_bucket",
  "_internal_meta"
)

root.catId = root.catId | if root.objectId != 0 {match (root.ObjectId / 1000).floor() {
    1 => 1,
    2 => 2,
    3 => 3,
    _ => deleted()
}} else { 0 }

root.cat = root.cat | match root.catId {
    0 => "General"
    1 => "Cat1",
    2 => "Cat2",
    3 => "Cat3",
    _ => deleted()
}

root.objectName = match root.objectId {
    0 => "Base Object"
    101 => "Object Name 101",
    102 => "Object Name 102",
    103 => "Object Name 103",
    104 => "Object Name 104",
    105 => "Object Name 105",
    106 => "Object Name 106",
    107 => "Object Name 107",
    108 => "Object Name 108",
    201 => "Object Name 201",
    202 => "Object Name 202",
    203 => "Object Name 203",
    204 => "Object Name 204",
    205 => "Object Name 205",
    206 => "Object Name 206",
    301 => "Object Name 301",
    302 => "Object Name 302",
    303 => "Object Name 303",
    304 => "Object Name 304",
    305 => "Object Name 305",
    306 => "Object Name 306",
    401 => "Object Name 401",
    402 => "Object Name 402",
    403 => "Object Name 403",
    404 => "Object Name 404",
    405 => "Object Name 405",
    406 => "Object Name 406",
    407 => "Object Name 407",
    408 => "Object Name 408",
    409 => "Object Name 409",
    410 => "Object Name 410",
    411 => "Object Name 411",
    412 => "Object Name 412",
    413 => "Object Name 413",
    414 => "Object Name 414",
    501 => "Object Name 501",
    502 => "Object Name 502",
    503 => "Object Name 503",
    504 => "Object Name 504",
    506 => "Object Name 506",
    507 => "Object Name 507",
    508 => "Object Name 508",
    509 => "Object Name 509",
    510 => "Object Name 510",
    511 => "Object Name 511",
    512 => "Object Name 512",
    513 => "Object Name 513",
    514 => "Object Name 514",
    515 => "Object Name 515",
    516 => "Object Name 516",
    517 => "Object Name 517",
    518 => "Object Name 518",
    519 => "Object Name 519",
    520 => "Object Name 520",
    601 => "Object Name 601",
    602 => "Object Name 602",
    603 => "Object Name 603",
    604 => "Object Name 604",
    605 => "Object Name 605",
    606 => "Object Name 606",
    607 => "Object Name 607",
    701 => "Object Name 701",
    702 => "Object Name 702",
    703 => "Object Name 703",
    704 => "Object Name 704",
    _ => deleted()
}

let activityId = root.activityId.or(0)
let activityName = if $activityId == 0 {
    root.activityName.or("Unknown")
}
root.objectTypeId = root.objectTypeId | if root.objectId != null { (root.objectId.or(0) * 100) + $activityId }
root.objectTypeName = root.objectTypeName | if root.objectName != null && root.objectName != "" && $activityName != "" {
    "%v: %v".format(root.objectName, $activityName)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants