Uber has widely adopted Go as a primary programming language for implementing backend services and libraries due to its high performance. The Go monorepo is the largest codebase at Uber, comprising 90 million lines of code (and growing). This makes tooling for writing reliable Go code a critical part of our development infrastructure.
Pointers (variables that hold the memory addresses of other variables instead of their actual values) are an integral part of the Go programming language and facilitate efficient memory management and effective data manipulation. Therefore, programmers use pointers extensively in writing Go programs for various purposes such as in-place data modification, concurrent programming, easy data sharing, optimizing memory usage, and facilitating interfaces and polymorphism. While pointers are powerful and widely used, it is essential to use them carefully and judiciously to avoid common pitfalls like nil pointer dereferences causing nil panics.
The Nil Panic Problem
A nil panic is a runtime panic that occurs when a program attempts to dereference a nil pointer. When a pointer is nil, it means that it does not point to any valid memory address, and attempting to access the value it points to will result in a panic (i.e., a runtime error) with the error message shown in Figure 1.
Figure 2 shows an example of a recent nil panic problem in the implementation of the Go standard library, particularly its net package, that was discovered and resolved. The panic was caused on line 1859 due to a direct call to the method String() on the return value of method RemoteAddr(), assuming it to be always non-nil, as shown in Figure 2. This is problematic when the field c.rwc of interface type net.Conn is assigned with the struct net.conn, since its concrete implementation of RemoteAddr() can return a nil value if the connection c is found to be not OK (shown in Figure 3). Specifically, RemoteAddr() can return a nil interface value on L225, leading to a nil panic when a method is called (.String() here) on it, since the nil value contains no pointer to any concrete method that can be invoked.
Nil panics are found to be an especially pervasive form of runtime errors in Go programs. Uber’s Go monorepo is no exception to this, and has witnessed several runtime errors in production because of nil panics, with effects ranging from incorrect program behavior to app outages, affecting Uber customers. Therefore, in order to maximize reliability and code quality, it is crucial for Uber to enable programmers to detect and fix nil panics early, before the buggy code gets deployed in production.
Nil panics can also cause denial of service attacks. For example, CVE-2020-29652 is due to a nil pointer dereference in the golang.org/x/crypto/ssh that allows remote attackers to cause a denial of service against SSH servers.
There exists an automated tool, nilness, offered by the Go distribution for detecting nil panics. This nilness checker is a lightweight static analysis technique that reports only simple errors, such as obvious sites of nil dereferences (e.g., if x == nil { print(*x) }). However, such simple checks fail to capture the complex nil flows in real programs, such as the one shown in Figure 2. Therefore, we need a technique that performs rigorous analysis and is effective on production code.
To deal with NullPointerExceptions (NPEs) in Java, Uber has developed NullAway. NullAway requires the code to be annotated with @Nullable annotations to guarantee NPE freedom during compile time. This limits the feasibility of directly adapting a NullAway-like technique for our purpose, since, unlike Java, Go does not have language support for annotations. Moreover, annotating a large codebase (e.g., Uber’s Go monorepo with 90 million lines of code) is a cumbersome task. Besides, Go’s various unique features and idiosyncrasies present their own unique challenges.
Our answer to overcome these limitations? NilAway.
We designed and developed NilAway for automatically detecting nil panics by employing sophisticated interprocedural static analysis and inferencing techniques. The design goal of NilAway was to have no annotation burden on developers, maintain minimal impact on local and CI build-times, and address the many challenges posed by Go language idioms in ways that are natural to Go developers.
Core Idea of NilAway
Our main idea is that nilability flows in code can be modeled as a system of global typing constraints, which can then be solved using a 2-SAT algorithm to determine potential contradictions. At a high level, we capture both nilable and nonnil constraints at various program sites for struct fields, function parameters, and return values. An example of a nilable constraint is return x, where x is an uninitialized pointer, while the dereference, *x, is an example of a nonnil constraint. We then build a global implication graph modeling these program site-specific constraints. Finally, we traverse the implication graph – forward propagating known nilness values and backward propagating known nonnil values – to find contradictions. For a site, S, if a contradiction nilable(S) ^ nonnil(S) is discovered in a program path of the implication graph, then it implies that a nil value is witnessed to flow from a nil source to the site S, from where it reaches a dereference point, which can likely cause a nil panic. NilAway collects and reports these contradictions as potential nil panics to the developer.
Figure 4 shows the path through the implication graph built by NilAway for the nil flow for the example presented in Figure 2. Here the nodes are program sites that could be a nilable type and edges are the nil flows between them. NilAway traverses the implication graph to find unsafe flows modeling them as contradictions. A flow is deemed unsafe if a witnessed nil value is found to flow through different program paths to a destination where that same value is expected to be nonnil, such as in the case of the nil value flowing from the concrete implementation net.conn.RemoteAddr() to its dereference via method invocation on interface declaration net.Conn.RemoteAddr(). NilAway reports a detailed error message for this nil panic (as shown in Figure 5) that allows developers to easily debug through the exact nil flow from evidenced nilability to its dereference, and apply the necessary fix to prevent the nil panic.
Note that, in general, for practical static type systems, with or without global inference of types, there will always exist error-free programs that do not satisfy a valid static typing. In the case of NilAway, note that the above algorithm doesn’t capture cases where subtle inter-procedural invariants in the execution of the program would prevent the nil to nonnil flow from happening at runtime. For example, in Figure 3, it is possible that some shared program state is set up such that whenever c.ok() is called from conn.RemoteAddr(), it always returns true, in which case no nil panic exists in that code. However, in practice, NilAway’s false positive rate is low and the cases where such complex execution invariants inherently prevent inferring proper nilness constraints tend to be associated with likely code smells.
Design and Implementation of NilAway
We designed and developed NilAway around the following four key requirements to make it a practical tool for Uber scale:
- Low latency: NilAway should incur only a low overhead in performing its analysis on the large Go codebase. We want NilAway to give developers immediate feedback when they introduce a potential nil panic, thereby requiring NilAway to be fast enough to run with low latency at every stage of our development pipeline, even during local builds. A high overhead would mean higher latency (delayed feedback), thereby reducing developer productivity.
- High effectiveness: NilAway should have a low false positive rate; inspecting false positive nil panics wastes developer time.
- Fully automated: NilAway should be fully automated, requiring no additional input from developers (e.g., annotations as in NullAway, or contrived coding patterns).
- Tailored to Go’s idiosyncrasies: NilAway should treat the idiosyncrasies in Go as first-class citizens and devise a system tailored to Go.
NilAway is implemented in Go and uses the go/analysis framework for the analysis of code. Figure 6 shows an overview of NilAway’s architecture. NilAway takes as input standard Go code, in the form of a target package path containing the code, and returns as output the potential nil panic errors that it identifies through its analysis. NilAway is implemented as an analyzer that can be used as an independent tool or, optionally, can also be easily integrated into a build system, such as Bazel, with existing analyzer drivers, such as nogo.
Broadly, the implementation of NilAway can be divided into 3 components: the Analyzer Engine, the Inference Engine, and the Error Engine. The Analyzer Engine is responsible for identifying all potential nil flows within a function independently (i.e., intra-procedurally), while the Inference Engine is responsible for collecting witnessed nilability values for different program sites and propagating this information through inter-procedural flows by building the implication graph. Finally, the Error Engine accumulates the information from both the Analyzer Engine and the Inference Engine, and marks each potential nil flow (intra- and inter-procedural) as safe or unsafe. Unsafe nil flows are then reported to the user as potential nil panic errors.
Powered with the novel constraint-based approach to detect nil panics, NilAway aptly satisfies the four requirements listed above:
- NilAway is fast. Independent analysis of each function in the Analyzer Engine makes it amenable to parallelization, which is a major performance enhancer. Furthermore, we have designed NilAway to construct the global implication graph incrementally by leveraging build cache, avoiding expensive re-building of the dependencies. This careful engineering makes NilAway fast and scalable, making it suitable for large codebases. In our measurements at Uber, we have observed that NilAway added only a small overhead (less than 5%) to the normal build process.
- NilAway is practical. To keep NilAway precise, the Analyzer Engine is designed and implemented to support many common Go language idiosyncrasies. Our Error Engine is also carefully designed to only report errors when an unsafe nil flow is evidenced. Having said that, we don’t claim our approach to be either sound or complete, instead having practical bug finding as our northstar. NilAway may incur both false positives and false negatives. However, we are continuously striving hard to reduce them and make NilAway precise. NilAway has been observed to work well in practice when deployed at Uber (as discussed subsequently), catching most of the potential nil panics in new code, allowing NilAway to maintain a good balance between usefulness and performance overhead.
- NilAway is fully automated. Our constraint-based approach makes it a natural fit for inference, which allows NilAway to operate in a fully automated mode with no annotations required.
Using NilAway at Uber
NilAway is deployed centrally in the Go monorepo, integrating tightly with the Bazel+Nogo framework, allowing it to run as a default linter on every build in the CI pipeline and local builds. The error reporting is, however, in the testing phase, where nil panic errors are only reported for services in the Go monorepo that are onboarded onto NilAway.
For service owners, we currently offer two options of error reporting: (1) comprehensive and blocking, and (2) stop-the-bleed and non-blocking.
In the first option, NilAway causes the build to fail, if any errors are found (suppressions are possible if needed, through //nolint:nilaway). NilAway comprehensively reports errors on all code, existing and new. This option is preferable to ensure a nil panic free codebase. However, it requires all reported nil panics in the service’s code to be addressed, before any build can be allowed to pass. This may incur a high upfront cost for the service’s development, which can cause friction among service owners.
To address the above problem, we offer a lightweight version in Option 2, in which we only report NilAway errors for changed code in the service. These errors are directly reported in a non-blocking way on every differential code revision (i.e., a pull request) of the onboarded service. This stop-the-bleed approach helps to prevent new nil panics from being introduced into the service code, while allowing teams to gradually address nil panics in existing code without the need for a development-slowing upfront onboarding effort.
We have onboarded several services at Uber onto NilAway, across both the options, and the overall feedback that we have received from the teams has been positive. One such happy user says “NilAway has helped their team catch issues early, preventing deployment rollbacks,” while another says “The comments left by NilAway are very actionable and it hasn’t caused any noise.” The users also actively report false positives that they may encounter and suggest usability improvements that we actively work upon.
Impactful Example
We now discuss one interesting case, where NilAway reported an important error in a service that was logging over 3,000 nil panics per day in production code. Figure 7 shows a simplified and redacted excerpt of the code causing the nil panic. This example uses the message passing construct of Go called channel. On line L16, the function call to t.s.foo(…) returns a channel ch which is subsequently received by the variable a. Unfortunately, Go allows reading from a closed channel, in which case a zero-value (i.e., nil) would be returned. If the code path L7->L8->L5 is taken in the function foo, the channel would be closed without anything written to it. This will cause a nil panic at the dereference point a.Items[*id] on line L17. NilAway correctly reported this error since it witnessed an unsafe dereference on the variable that may be received from a closed channel.
The fix for this problem is to properly guard the receive from a closed channel, either using the ok construct of Go (e.g., if a, ok := <-t.s.foo(…); ok { … }) or by a nilness check on the result variable a (e.g., if a != nil { … }) before the dereference on L17. Our developers applied the nilness check fix right after NilAway reported this error, and the impact was remarkable: the service went from logging 3,000+ nil panics daily to 0, as shown in Figure 8.
Using NilAway for Your Code
We are happy to announce that NilAway is now open source at https://github.com/uber-go/nilaway/. We believe NilAway will be useful for any individual or team that implements code in Go and wants to ensure a nil-panic-free codebase.
Setting up NilAway is fairly straightforward. It can be used as a standalone checker or integrated with existing drivers. Refer to the README and wiki for more details.
Try NilAway today and let us know your experience. We also welcome contributions from the community.
Acknowledgements
NilAway began as the internship project of Joshua Turcotti (Uber intern ’22) and benefited from the very significant contributions of the following Uber Ph.D. interns: Shubham Ugare, Narges Shadab, and Zhiqiang Zang. We also would like to thank the Go monorepo team at Uber for collaborating with us in building NilAway, with special thanks to Dmitriy Shirchenko.
Header image by Tanmayee Deshprabhu via flickr under the Creative Commons license.
Sonal Mahajan
Sonal Mahajan is a Senior Software Engineer on Uber’s Programming Systems team. Her research interests cover software engineering and artificial intelligence, with a particular focus on using program analysis and machine learning to develop automated tools for improving code quality, reliability, and developer productivity.
Yuxin Wang
Yuxin Wang is a Software Engineer on Uber’s Programming Systems team. His research interests include general static analysis, software verification, and program repair and synthesis. His current work includes building fast tooling for improving software reliability, as well as automated code review tools for developer productivity.
Lazaro Clapp
Lazaro Clapp is a Staff Engineer and TLM on Uber's Programming Systems team. His current focus is on improving application reliability using fast type-system based tools. His broader research interests include general static and dynamic analysis, modeling of third-party code behavior, and automated testing and code repair. https://lazaroclapp.com/
Raj Barik
Raj Barik is a former Principal Engineer and TLM in the Programming Systems group at Uber. He led the Programming Systems group and delivered a number of impactful program analysis tools to reduce infrastructure cost, improve code quality, and increase developer velocity. His broad research interests include Programming Languages, Program analysis, Compilers, and Performance optimization tooling.
Posted by Sonal Mahajan, Yuxin Wang, Lazaro Clapp, Raj Barik
Related articles
Most popular
Open Source and In-House: How Uber Optimizes LLM Training
Horacio’s story: gaining mobility independence through innovative transportation solutions
Streamlining Financial Precision: Uber’s Advanced Settlement Accounting System
Uber, Unplugged: insights from 6 transit leaders on the future of how we move
Products
Company