This document is relevant for: Inf1
Neuron compiler CLI Reference Guide (neuron-cc
)#
This document describes the command line interface of the Neuron compiler. This reference is not relevant for applications that run neuron-cc from within a machine learning framework (TensorFlow-Neuron for example) since these options are passed from the framework directly to neuron-cc.
Using neuron-cc on the command line may be desirable for applications that do not use a framework, or customize existing frameworks. It is also possible to supply CLI commands to the framework as options to be passed through to the compiler.
Usage#
Optional parameters are shown in square brackets. See the individual framework guides for the correct syntax.
Neuron Compiler CLI
- neuron-cc [options] <command> [parameters]#
Common options for the Neuron CLI:
--verbose
(string) default=“WARN”:Valid values:
DEBUG
INFO
WARN
ERROR
Use neuron-cc <command> --help
for information on a specific command.
Available Commands:#
compile
list-operators
- neuron-cc compile [parameters]#
Compile a model for use on the AWS Inferentia Machine Learning Accelerator.
neuron-cc compile <file names> --framework <value> --io-config <value> [--neuroncore-pipeline-cores <value>] [--enable-saturate-infinity] [--enable-fast-loading-neuron-binaries] [--enable-fast-context-switch] [--fp32-cast cast-method] [--fast-math cast-method] [--output <value>]
Compile Parameters:
<file names>
: Input containing model specification. The number of arguments required varies between frameworks:TENSORFLOW: A local filename or URI of a TensorFlow Frozen GraphDef (.pb); or the name of a local directory containing a TensorFlow SavedModel.
See tensorflow/tensorflow for the associated .proto schema for TensorFlow Frozen GraphDefs. See https://www.tensorflow.org/guide/saved_model for more information on the SavedModel format.
MXNET: List of local filenames or URIs where input architecture .json file and parameter .param file are stored. These contains information related to the architecture of your graph and associated parameters, respectively.
--framework
(string): Framework in which the model was trained.Valid values:
TENSORFLOW
MXNET
XLA
--neuroncore-pipeline-cores
(int) (default=1): Number of neuron cores to be used in “NeuronCore Pipeline” mode. This is different from data parallel deployment (same model on multiple neuron cores). Refer to Runtime/Framework documentation for data parallel deployment options.Compile for the given number of neuron cores so as to leverage NeuronCore Pipeline mode.
Note
This is not used to define the number of Neuron Cores to be used in a data parallel deployment (ie the same model on multiple Neuron Cores). That is a runtime/framework configuration choice.
--output
(string) (default=“out.neff”): Filename where compilation output (NEFF archive) will be recorded.--io-config
(string): Configuration containing the names and shapes of input and output tensors.The io-config can be specified as a local filename, a URI, or a string containing the io-config itself.
The io-config must be formatted as a JSON object with two members “inputs” and “outputs”. “inputs” is an object mapping input tensor names to an array of shape and data type. “outputs” is an array of output tensor names. Consider the following example:
{ "inputs": { "input0:0": [[1,100,100,3], "float16"], "input1:0": [[1,100,100,3], "float16"] }, "outputs": ["output:0"] }
--enable-saturate-infinity
: Convert +/- infinity values to MAX/MIN_FLOAT for certain computations that have a high risk of generating Not-a-Number (NaN) values. There is a potential performance impact during model execution when this conversion is enabled.--enable-fast-loading-neuron-binaries
: Write the compilation output (NEFF archive) in uncompressed format which results in faster loading of the archive during inference.--enable-fast-context-switch
: Optimize for faster model switching rather than inference latency. This results in overall faster system performance when your application switches between models frequently on the same neuron core (or set of cores). The optimization triggered by this option for example defers loading some weight constants until the start of inference.--fast-math
: Controls tradeoff between performance and accuracy for fp32 operators. See more suggestions on how to use this option with the below arguments in Mixed precision and performance-accuracy tuning (neuron-cc).all
(Default): enables all optimizations that improve performance. This option can potentially lower precision/accuracy.none
: Disables all optimizations that improve performance. This option will provide best precision/accuracy.Tensor transpose options
fast-relayout
: Only enables fast relayout optimization to improve performance by using the matrix multiplier for tensor transpose. The data type used for the transpose is either FP16 or BF16, which is controlled by thefp32-cast-xxx
keyword.no-fast-relayout
: Disables fast relayout optimization which ensures that tensor transpose is bit-accurate (lossless) but slightly slower.
Casting options
fp32-cast-all
(Default): Cast all FP32 operators to BF16 to achieve highest performance and preserve dynamic range. Same as setting--fp32-cast all
.fp32-cast-all-fp16
: Cast all FP32 operators to FP16 to achieve speed up and increase precision versus BF16. Same setting as--fp32-cast all-fp16
.fp32-cast-matmult
: Only cast FP32 operators that use Neuron Matmult engine to BF16 while using FP16 for matmult-based transpose to get better accuracy. Same as setting--fp32-cast matmult
.fp32-cast-matmult-bf16
: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to BF16 to preserve dynamic range. Same as setting--fp32-cast matmult-bf16
.fp32-cast-matmult-fp16
: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to fp16 to better preserve precision. Same as setting--fp32-cast matmult-fp16
.
Important
all
andnone
are mutually exclusiveall
is equivalent to usingfp32-cast-all fast-relayout
(best performance)none
is equivalent to usingfp32-cast-matmult-bf16 no-fast-relayout
(best accuracy)fp32-cast-*
options are mutually exclusivefast-relayout
andno-fast-relayout
are mutually exclusiveThe
fp32-cast-*
and*-fast-relayout
options will overwrite the default behavior inall
andnone
.For backward compatibility, the
--fp32-cast
option has higher priority over--fast-math
. It will overwrite the FP32 casting options in any of the--fast-math
options if--fp32-cast
option is present explicitly.
--fp32-cast
: Refine the automatic casting of fp32 tensors. This is being replaced by a newer –fast-math.Important
--fp32-cast
option is being deprecated and--fast-math
will replace it in future releases.--fast-math
is introducing theno-fast-relayout
option to enable lossless transpose operation.
The
--fp32-cast
is an interface for controlling the performance and accuracy tradeoffs. Many of the--fast-math
values invoke (override) it.all
(default): Cast all FP32 operators to BF16 to achieve speed up and preserve dynamic range.matmult
: Cast only FP32 operators that use Neuron Matmult engine to BF16 while using fp16 for matmult-based transpose to get better accuracy.matmult-fp16
: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to fp16 to better preserve precision.matmult-bf16
: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to BF16 to preserve dynamic range.all-fp16
: Cast all FP32 operators to FP16 to achieve speed up and better preserve precision.
Log Levels:
Logs at levels “trace”, “debug”, and “info” will be written to STDOUT.
Logs at levels “warn”, “error”, and “fatal” will be written to STDERR.
Exit Status
0 - Compilation succeeded
>0 - An error occurred during compilation.
Examples
Compiling a saved TensorFlow model:
neuron-cc compile test_graph_tfmatmul.pb --framework TENSORFLOW --io-config test_graph_tfmatmul.config
Compiling a MXNet model:
neuron-cc compile lenet-symbol.json lenet-0001.params --framework MXNET --neuroncore-pipeline-cores 2 --output file.neff
Compiling an XLA HLO:
neuron-cc compile bert-model.hlo --framework XLA --output file.neff
- neuron-cc list-operators [parameters]#
Returns a newline (‘n’) separated list of operators supported by the NeuronCore.
TENSORFLOW: Operators will be formatted according to the value passed to the associated REGISTER_OP(“OperatorName”) macro.
See https://www.tensorflow.org/guide/create_op#define_the_op_interface for more information regarding operator registration in TensorFlow.
MXNET: Operator names will be formatted according to the value passed to the associated NNVM_REGISTER_OP(operator_name) macro.
XLA: Operator names will be formatted according to the value used by XLA compiler in XlaBuilder.
See https://www.tensorflow.org/xla/operation_semantics for more information regarding XLA operator semantics in XLA interface.
neuron-cc list-operators --framework <value>
--framework
(string): Framework in which the operators were registered.Valid values:
TENSORFLOW
MXNET
XLA
Exit Status
0 - Call succeeded
>0 - An error occurred
Example
$ neuron-cc list-operators --framework TENSORFLOW AddN AdjustContrastv2 CheckNumbers ...
This document is relevant for: Inf1