When Software Relies on Undocumented Behavior: The Tale of Restartable Sequences and TCMalloc

From Yogawife, the free encyclopedia of technology

The Unseen Consequences of Hyrum's Law

Hyrum's Law, a principle from software engineering, states that any observable behavior—even if not part of the official documentation—will eventually be depended upon by someone. This law often creates a tension between the desire to evolve systems and the need to maintain backward compatibility. The Linux kernel community is currently grappling with a vivid example of this phenomenon, involving the restartable sequences mechanism, Google's TCMalloc memory allocator, and the kernel's strict no-regressions policy.

When Software Relies on Undocumented Behavior: The Tale of Restartable Sequences and TCMalloc

What Are Restartable Sequences?

Restartable sequences (rseq) are a Linux kernel feature introduced to enhance performance for per-CPU operations. They allow user-space code to execute a sequence of instructions atomically relative to specific kernel events, such as preemption or migration to another CPU. If the sequence is interrupted, the kernel automatically restarts it from the beginning. This is particularly useful for libraries that manage per-CPU data structures, like memory allocators, to avoid costly locks.

The documented API for restartable sequences defines a fixed set of operations and behaviors. For example, the kernel provides a system call to register a restartable sequence, and it guarantees that the sequence will either complete without interruption or be restarted. The API explicitly lists which kernel actions can cause a restart, such as signal delivery or context switch.

TCMalloc and Its Reliance on Undocumented Details

Google's TCMalloc is a high-performance memory allocator designed for multithreaded applications. To achieve low overhead, TCMalloc heavily uses the restartable sequences feature for per-CPU caching of memory blocks. However, over time, TCMalloc began to depend not only on the documented API but also on specific implementation details of the kernel's restartable sequences handling.

Specifically, TCMalloc relied on the behavior that the kernel would not restart the sequence under certain conditions where the documented API suggested it might. For example, TCMalloc assumed that if the sequence was interrupted by a CPU migration immediately after a particular memory write, the restart would preserve certain register values. This dependency was never officially part of the kernel's contract, yet it became a critical assumption for TCMalloc's performance and correctness.

As Hyrum's Law predicts, this undocumented behavior became a de facto part of the system's interface—until kernel developers decided to change it.

The 6.19 Kernel Changes and the Clash

In the 6.19 release, kernel developers improved the restartable sequences implementation to fix performance problems and prepare for future enhancements. The changes were carefully designed to preserve the documented API: all promised behaviors remained intact. However, the internal mechanics shifted slightly—most notably, the conditions under which a sequence would be restarted became more precise.

This subtle shift broke TCMalloc's undocumented assumptions. After upgrading to 6.19, TCMalloc would occasionally fail to restart sequences in the way it expected, leading to data corruption and crashes. The problem was not that the kernel violated its documented contract; it was that TCMalloc had become dependent on a specific, unintended implementation detail.

The kernel's no-regressions rule, which mandates that new kernel versions must not break existing user-space applications, put the development team in a bind. Even though TCMalloc was the one violating the documented API by relying on implementation-specific behavior, the kernel could not simply tell users to fix their allocator. The no-regressions policy requires finding a multi-year transition path or a compatibility mechanism.

Accommodating TCMalloc Without Breaking Others

To resolve the conflict, kernel developers had to introduce a compatibility mode that mimics the old implementation details for processes that use TCMalloc. This is done through a new kernel configuration option, allowing the old restart behavior to be enabled via a sysctl or a process attribute. This way, applications that depend on the documented API get the improved performance, while TCMalloc (and any other software with similar undocumented dependencies) can continue to work unchanged.

This solution is not ideal—it adds complexity and maintenance burden—but it aligns with the principle that user space must not be broken by kernel changes. The kernel community hopes that over time, TCMalloc will be updated to rely only on the documented API, allowing the compatibility mode to be deprecated.

Lessons for Software Developers

This incident serves as a cautionary tale about the dangers of relying on undocumented behavior. While Hyrum's Law reminds us that such dependencies are inevitable, developers can mitigate risks by:

  • Strictly adhering to official APIs and avoiding assumptions about implementation internals.
  • Participating in the development of those APIs to ensure they meet performance needs without requiring undocumented shortcuts.
  • Using abstraction layers that can be updated independently of the underlying system.

The kernel's no-regressions rule, while sometimes frustrating, ultimately protects the stability that millions of users rely on. The challenge is balancing innovation with compatibility—a challenge that the restartable sequences saga illustrates perfectly.

For more on how the kernel community handles similar issues, see Hyrum's Law and the 6.19 change details above.