This document is relevant for: Inf2
, Trn1
, Trn2
Neuron Kernel Interface (NKI) - Beta#
Neuron Kernel Interface (NKI) is a bare-metal language and compiler for directly programming NeuronDevices available on AWS Trn/Inf instances. You can use NKI to develop, optimize and run new operators directly on NeuronCores while making full use of available compute and memory resources. NKI empowers ML developers to self-serve and invent new ways to use the NeuronCore hardware, starting NeuronCores v2 (Trainium1) and beyond.
NKI provides developers with direct access to the NeuronCore ISA (Instruction Set Architecture), accessible from a Python-based programming environment, which has syntax and tile-level semantics that are similar to Triton and NumPy. This enables developers to get started quickly and optimize performance in a familiar environment, while at the same time get full control of the underlying hardware. At the hardware level, NeuronCore’s tensorized memory access capability enables efficient reading and writing of multi-dimensional arrays on a per instruction basis, which makes NKI’s tile-based programming highly suitable for the NeuronCore instruction set.
For comparison, before NKI was introduced, the only way to program NeuronDevices was through defining high-level ML models in frameworks such as PyTorch and JAX. Neuron Compiler takes such high-level model definitions as input, performs multiple rounds of optimization, and eventually generates a NEFF (Neuron Executable File Format) that is executable on NeuronDevices. At a high level, Neuron Compiler runs the following optimization stages in order:
Hardware-agnostic graph-level optimizations. These transformations are done in the compiler front-end, using XLA, including optimizations like constant propagation, re-materialization and operator fusion.
Loop-level optimization. Compiler turns the optimized graph from Step 1 into a series of loop nests and performs layout, tiling and loop fusion optimizations.
Hardware intrinsics mapping. Compiler maps the architecture-agnostic loop nests from Step 2 into architecture-specific instructions.
Hardware-specific optimizations. These optimizations are mainly done at the instruction level [1] in compiler back-end, with a key goal of reducing memory pressure and improving instruction-level parallelism. For example, memory allocation and instruction scheduling are done in this stage.
NKI kernels bypass the first 3 steps, and are compiled into IRs (intermediate representations) that the compiler’s back-end (Step 4 above) can directly consume. Advanced features in NKI, such as direct allocation, also allow programmers to bypass certain compiler passes in Step 4. As a result, NKI developers can now have great control over NeuronDevices down to the instruction level. We highly recommend developers to study the underlying hardware architecture before optimizing performance of their NKI kernels. See the NKI guide below to learn more!
Guide#
NKI guide is organized in four parts:
API Reference Guide has the NKI API reference manual.
Writing Functional NKI Kernels includes guides that are designed for NKI beginners to learn NKI key concepts and implement kernels to meet functionality requirements.
Writing Performant NKI Kernels includes a deep dive of NeuronDevice architecture and programmer’s guides to optimize performance of NKI kernels.
General Resources include any miscellaneous guides.
API Reference Guide#
Writing Functional NKI Kernels#
Writing Performant NKI Kernels#
General Resources#
Footnotes
This document is relevant for: Inf2
, Trn1
, Trn2