This document is relevant for: Inf2
, Trn1
, Trn2
NKI Known Issues#
This document outlines some of the known issues and limitations for the NKI beta release.
Unsupported Syntax:#
Top-level tensors must be on HBM. The input and output tensors of the top-level NKI kernel (the kernel function decorated with
nki_jit
/nki.baremetal
or called by JAXnki_call
) must be located in HBM. We currently do not support using tensors stored in SBUF or PSUM as the input or output of the top-level kernel. Tensors must be loaded from HBM into SBUF before use, and output tensors must be stored from SBUF back into HBM. See nl.load and nl.store.Indexing:
Tile on SBUF/PSUM must have at least 2 dimensions as described here. If using a 1D tile on SBUF/PSUM, users may get an “
Insufficient rank
” error. Workaround this by creating a 2D tile, e.g.,buf = nl.zeros((128, ), dtype=dtype, buffer=nl.sbuf) # this won't work buf = nl.zeros((128, 1), dtype=dtype, buffer=nl.sbuf) # this works
Users must index their
[N, 1]
or[1, M]
shaped 2D buffers with both indices, domy_sbuf[0:N, 0]
ormy_sbuf[0, 0:M]
to access them, since accessing in 1Dmy_sbuf[0:N]
won’t work.Use
nl.arange
for indirect load/store access indexing,nl.mgrid
won’t work. See code examples in nl.load and nl.store.If indexing with
[0, 0]
gets internal errors, try using[0:1, 0:1]
ornl.mgrid[0:1, 0:1]
instead.If indexing with
[0:1, ...]
gets internal errors, try using[0, ...]
instead.
Masks conjunction: Use
&
to combine masks. We do not support usingand
for masks. See examples in NKI API Masking.nisa.bn_stats does not support mask on the reduce dimension, the mask sent to
bn_stats
could not contain any indices from the reduction dimension.Partition dimension broadcasting is not supported on operator overloads (i.e,
+
,-
,*
,/
,<<
,>>
, etc), usenki.language
APIs instead (i.e,nl.add
,nl.multiply
, …).When direct allocation API is used, non-IO HBM tensors are not supported.
All tensors declared with
buffer=nl.shared_hbm
must be returned as the result of the kernel.Tensors declared with
buffer=nl.hbm
orbuffer=nl.private_hbm
are not allowed.An error “
[NKI005] (float32 [128, 512] %'<name of the hbm tensor>':5)0: DRAM location of kind Internal mapping failed. Only input/output/const DRAM location is supported!
” will be thrown when such tensor is encountered.
Unexpected Behavior:#
Simulation using nki.simulate_kernel:
Custom data types like
nl.float32r
,nl.bfloat16
,nl.float8_e4m3
, andnl.float8_e5m2
simulate infp32
precision. Also, NumPy API calls outside of the NKI kernel, such asnp.allclose
may not work with the above types.nl.rand generates the same values for subsequent calls to
nl.rand()
.nl.random_seed is a no-op in simulation.
nisa.dropout is a no-op in simulation.
Masks don’t work in simulation, and garbage data is generated in tensor elements that are supposed to be untouched based on API masking.
Execution:
Profiler:
When using
neuron-profile
use the flag--disable-dge
to workaround a temporary issue with DMA information. See the Profile using neuron-profile section for more details.
Optimization:
Users need to declare their NKI buffers as small as possible to avoid buffer overflow errors. An error “
[GCA046] Some infinite-cost nodes remain
” may mean there’s a buffer overflow, workaround this by creating smaller local buffers.
Compiler passes:
NKI ISA API may not be one-to-one with generated hardware ISA instructions. The compiler may aid in the support of these instruction calls by adding additional instructions.
This document is relevant for: Inf2
, Trn1
, Trn2