This document is relevant for: Inf2
, Trn1
, Trn2
nki.language.loop_reduce#
- nki.language.loop_reduce(x, op, loop_indices, *, dtype=None, mask=None, **kwargs)[source]#
Apply reduce operation over a loop. This is an ideal instruction to compute a high performance reduce_max or reduce_min.
Note: The destination tile is also the rhs input to
op
. For example,b = nl.zeros((N_TILE_SIZE, M_TILE_SIZE), dtype=float32, buffer=nl.sbuf) for k_i in affine_range(NUM_K_BLOCKS): # Skipping over multiple nested loops here. # a, is a psum tile from a matmul accumulation group. b = nl.loop_reduce(a, op=np.add, loop_indices=[k_i], dtype=nl.float32)
is the same as:
b = nl.zeros((N_TILE_SIZE, M_TILE_SIZE), dtype=nl.float32, buffer=nl.sbuf) for k_i in affine_range(NUM_K_BLOCKS): # Skipping over multiple nested loops here. # a, is a psum tile from a matmul accumulation group. b = nisa.tensor_tensor(data1=b, data2=a, op=np.add, dtype=nl.float32)
If you are trying to use this instruction only for accumulating results on SBUF, consider simply using the
+=
operator instead.The
loop_indices
list enables the compiler to recognize which loops this reduction can be optimized across as part of any aggressive loop-level optimizations it may perform.- Parameters:
x – a tile.
op – numpy ALU operator to use to reduce over the input tile.
loop_indices – a single loop index or a tuple of loop indices along which the reduction operation is performed. Can be numbers or loop_index objects coming from
nl.affine_range
.dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
- Returns:
the reduced resulting tile
This document is relevant for: Inf2
, Trn1
, Trn2