Bazel and Glibc Versions
Imagine this scenario: your team uses Bazel for fast, distributed C++ builds. A developer builds a change on their workstation, all tests pass, and the change is merged. The CI system picks it up, gets a cache hit from the developer’s build, and produces a release artifact. Everything looks green. But when you deploy to production, the service crashes with a mysterious error: version 'GLIBC_2.28' not found.
The answer lies in the subtle but dangerous interaction between Bazel’s caching, remote execution, and differing glibc versions across your fleet.
In this article, we'll dive deep into how glibc versions can break build reproducibility and present several ways to fix it—from an interesting hack (which spawned this whole series) to the ultimate, most robust solution.
Suppose you have a pretty standard (corporate?) development environment like the following:
- Developer workstations (WS). This is where Bazel runs during daily development, and Bazel can execute build actions both locally and remotely.
- A CI system. This is a distributed cluster of machines that run jobs, including PR merge validation and production release builds. These jobs execute Bazel too, who in turn executes build actions both locally and remotely.
- The remote execution (RE) system. This is a distributed cluster of worker machines that execute individual Bazel build actions remotely.
- The production environment (PROD). This is where you deploy binary artifacts to serve your users. No build actions run here.
All of the systems above run some version of Linux, and it is tempting to keep such version in sync across them all. However, this wish is misguided and plain impossible.
The reasons would include keeping operations simpler and ensuring that build actions can run consistently no matter where they are executed. However, this wish is misguided because you may not want to run the same Linux distribution on all three environments: after all, the desktop distribution you run on WS may not be the best choice for RE workers, CI nodes, nor production.
And it is plain impossible because, even if you aligned versions to the dot, you would need to take upgrades at some point: distributed upgrades must be rolled out over a period of time (weeks or even months) for reliability, so you’d have to deal with version skew anyway.
To make matters more complicated, the remote AC is writable from all of WS, CI, and RE to maximize Bazel cache hits and optimize build times. This goes against best security practices (so there are mitigations in place to protect PROD), but it’s a necessity to support an ongoing onboarding into Bazel and RE.
The question becomes: can the Linux version skew among all machines involved cause problems with remote caching? It sure can because C and C++ build actions tend to pick up system-level dependencies in a way that Bazel is unaware of (by default), and those influence the output the actions produce.
glibc versions its symbols to provide runtime backwards compatibility when their internal details change, and this means that binaries built against newer glibc versions may not run on systems with older glibc versions. How is this a problem though?
We'll take a look by making the problem specific. Consider the following environment:
- In this environment, developers run Bazel in WS for their day-to-day work, and CI-1 runs Bazel to support development flows (PR merge-time checks) and to produce binaries for PROD.
- Ci-2 sometimes runs builds too. All of these systems can write to the AC that lives in RE.
All of these systems can write to the AC that lives in RE. As it so happens, one of the C++ actions involved in the build of prod-binary, say //base:my_lib, has a local tag which forces the action to bypass remote execution.
This can lead to the following sequence of events:
- A developer runs a build on a WS. //base:my_lib has changed so it is rebuilt on the WS. The action uses the C++ compiler, so the object files it produces pick up the dependency on glibc 2.28.
- The result of the action is injected into the remote cache.
Ci-1 schedules a job to build prod-binary for release. This job runs Bazel on a machine with glibc 2.17 and leverages the RE cluster which also contain glibc 2.17.
- Many C++ actions get rebuilt but //base:my_lib is reused from the cache.
- The production artifact now has a dependency on symbols from glibc 2.28.
Release engineering picks the output of CI-1, deploys the production binary to PROD, and… boom, PROD explodes:
The fact that the developer WS could write to the AC is very problematic on its own, but we could encounter this same scenario if we first ran the production build on CI-2 for testing purposes and then reran it on CI-1 to generate the final artifact.
So, what do we do now? In a default Bazel configuration, C and C++ action keys are underspecified and can lead us to non-deterministic behavior when we have a mixture of host systems compiling them.
Solution A: manually partition the AC
- Let’s start with the case where you aren’t yet ready to strictly restrict writes to the AC from RE workers, yet you want to prevent obvious mistakes that lead to production breaks.
- The idea here is to capture the glibc version that is used in the local and remote environments, pick the higher of the two, and make that version number an input to the C/C++ toolchain.
This causes the version to become part of the cache keys and should prevent the majority of the mistakes we may see. WARNING: This is The Hack I recently implemented and that drove me to writing this article series!
To implement this hack, the first thing we have to do is capture the local glibc version.
- We can do this with:
// Assume you have a bazel-out tree
action(
name = 'get_glibc_version',
type = 'exec',
srcs = ['stable-status.txt'],
exec_status_action = {
status = 'RUNNING'
},
outputs = {
glibc_version = {
default_value = 'unknown'
}
},
genrule_contents = [
'@rules_execrule//:get_glibc_version'
],
genrule_script = 'export glibc_version=$(cat stable-status.txt)'
)
This is necessary to force this action to rerun on every build because we don’t want to hit the case of using an old bazel-out tree against an upgraded system.
As a consequence, we need to modify the script pointed at by --workspace_status_command script (you have one, right?) to emit the glibc version:
- We can do this with:
// Assume you have a bazel-out tree
action(
name = 'get_glibc_version',
type = 'exec',
srcs = ['stable-status.txt'],
exec_status_action = {
status = 'RUNNING'
},
outputs = {
glibc_version = {
default_value = 'unknown'
}
},
genrule_contents = [
'@rules_execrule//:get_glibc_version'
],
genrule_script = 'export glibc_version=$(cat stable-status.txt)'
)
This will allow us to capture the local glibc version and make it available as an output.
Solution A: manually partition the AC (continued)
- To implement this hack, we also need to modify the remote execution script to use the captured local glibc version:
// Assume you have a bazel-out tree
action(
name = 'get_glibc_version',
type = 'exec',
srcs = ['stable-status.txt'],
exec_status_action = {
status = 'RUNNING'
},
outputs = {
glibc_version = {
default_value = 'unknown'
}
},
genrule_contents = [
'@rules_execrule//:get_glibc_version'
],
genrule_script = 'export glibc_version=$(cat stable-status.txt)'
)
// Assume you have a bazel-out tree action( name = 'remote_execute', type = 'exec', srcs = ['my_source_file'], exec_status_action = { status = 'RUNNING' }, outputs = { result = { default_value = '' } }, genrule_contents = [ '@rules_execrule//:get_glibc_version' ], genrule_script = 'export glibc_version=$(cat stable-status.txt)' )
This will allow us to use the captured local glibc version in our remote execution script.
Solution B: securing the AC
- Let’s start with securing the AC. We can do this by adding a check to ensure that we’re using the correct glibc version:
// Assume you have a bazel-out tree
action(
name = 'get_glibc_version',
type = 'exec',
srcs = ['stable-status.txt'],
exec_status_action = {
status = 'RUNNING'
},
outputs = {
glibc_version = {
default_value = 'unknown'
}
},
genrule_contents = [
'@rules_execrule//:get_glibc_version'
],
genrule_script = 'export glibc_version=$(cat stable-status.txt)'
)
// Assume you have a bazel-out tree action( name = 'check_glibc_version', type = 'exec', srcs = ['@get_glibc_version//:glibc_version'], exec_status_action = { status = 'RUNNING' }, outputs = { result = { default_value = '' } }, genrule_contents = [ '@rules_execrule//:print_result' ], genrule_script = 'if [ "$glibc_version" != "2.17" ]; then echo "Error: Using the wrong glibc version"; else echo ""; fi' )
This will add a check to ensure that we’re using the correct glibc version.
Solution C: sysroot solution
- Let’s move on to the sysroot solution. This is the only solution that can provide you 100% safety against the problem presented in this article.
// Assume you have a bazel-out tree
action(
name = 'set_sysroot',
type = 'exec',
srcs = ['@sysroot//:sysroot'],
exec_status_action = {
status = 'RUNNING'
},
outputs = {
sysroot = {
default_value = ''
}
},
genrule_contents = [
'@rules_execrule//:print_result'
],
genrule_script = 'export PATH=$PATH:$sysroot'
)
// Assume you have a bazel-out tree action( name = 'remote_execute', type = 'exec', srcs = ['my_source_file'], exec_status_action = { status = 'RUNNING' }, outputs = { result = { default_value = '' } }, genrule_contents = [ '@rules_execrule//:print_result' ], genrule_script = 'export PATH=$PATH:$sysroot' )
This will allow us to set the sysroot and use it in our remote execution script.
Therefore, you need to take action. I’d strongly recommend that you go towards the sysroot solution because it’s the only one that’ll give you a stable path for years to come, but I also understand that it’s hard to implement.
Therefore, take the solutions in the order I gave them to you: start with the hack to mitigate obvious problems, follow that up with securing the AC, and finally go down the sysroot rabbit hole.