This document is relevant for: Inf2, Trn1, Trn2

nki.isa.activation_reduce#

nki.isa.activation_reduce(op, data, *, reduce_op, reduce_res, bias=None, scale=1.0, mask=None, dtype=None, **kwargs)[source]#

Perform the same computation as nisa.activation and also a reduction along the free dimension of the nisa.activation result using Scalar Engine.

Refer to nisa.activation for semantics of op/data/bias/scale.

In addition to nisa.activation computation, this API also performs a reduction along the free dimension(s) of the nisa.activation result, at a small additional performance cost. The reduction result is returned in reduce_res in-place, which must be a SBUF/PSUM tile with the same partition axis size as the input tile data and one element per partition. On NeuronCore-v2, the reduce_op can only be an addition, np.add or nl.add.

Reduction axis is not configurable in this API. If the input tile has multiple free axis, the API will reduce across all of them.

Mathematically, this API performs the following computation:

\[\begin{split}output = f_{act}(data * scale + bias) \\ reduce\_res = reduce\_op(output, axis=<FreeAxis>)\end{split}\]

Estimated instruction cost:

max(MIN_II, N) + MIN_II Scalar Engine cycles, where

  • N is the number of elements per partition in data, and

  • MIN_II is the minimum instruction initiation interval for small input tiles. MIN_II is roughly 64 engine cycles.

Parameters:
  • op – an activation function (see Supported Activation Functions for NKI ISA for supported functions)

  • data – the input tile; layout: (partition axis <= 128, free axis)

  • reduce_op – the reduce operation to perform on the free dimension of the activation result

  • reduce_res – a tile of shape (data.shape[0], 1), where data.shape[0] is the partition axis size of the input data tile. The result of sum(ReductionResult) is written in-place into the tensor.

  • bias – a vector with the same partition axis size as data for broadcast add (after broadcast multiply with scale)

  • scale – a scalar or a vector with the same partition axis size as data for broadcast multiply

  • dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

  • mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

Returns:

output tile of the activation instruction; layout: same as input data tile

This document is relevant for: Inf2, Trn1, Trn2