NxD Inference - Production Ready Models#
Neuronx Distributed Inference provides production ready models that you can directly use for seamless deployment. You can view the source code for all supported models in the NxD Inference GitHub repository.
Note
If you are looking to deploy a custom model integration, you can follow the model onboarding guide. You can refer to the source code for supported models in the NxD Inference GitHub repository and make custom changes required for your use case.
Using Models to Run Inference#
You can run models through vLLM or integrate directly with NxD Inference.
Using vLLM#
If you are using vLLM for production deployment, we recommend that you use the vLLM API to integrate with NxD Inference. The vLLM API automatically chooses the correct model and config classes based on the model’s config file. For more information, refer to the vLLM User Guide for NxD Inference.
Integrating Directly with NxD Inference#
To use NxD Inference directly, you construct model and configuration classes. For more information about which model and configuration classes to use for each model, see Supported Model Architectures. To see an example of how to run inference directly with NxD Inference, see the generation_demo.py script.
Supported Model Architectures#
NxD Inference currently provides support for the following model architectures.
Llama (Text)#
NxD Inference supports Llama text models. The Llama model architecture supports all Llama text models, including Llama 2, Llama 3, Llama 3.1, Llama 3.2, and Llama 3.3. You can also use the Llama model architecture to run any model based on Llama, such as Mistral.
Neuron Classes#
Neuron config class: MoENeuronConfig
Inference config class: MixtralInferenceConfig
Causal LM model class: NeuronMixtralForCausalLM
Compatible Checkpoint Examples#
https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct (requires Trn2)
Llama (Multimodal)#
NxD Inference supports Llama 3.2 multimodal models. You can use HuggingFace checkpoints or the original Meta checkpoints. To use the Meta checkpoint, you must first convert the checkpoint to Neuron format. For more information about how to run Llama3.2 multimodal inference, and for details about how to convert the original Meta checkpoints to run on NxD Inference, see Tutorial: Deploying Llama3.2 Multimodal Models.
Neuron Classes#
Neuron config class: MultimodalVisionNeuronConfig
Inference config class: MllamaInferenceConfig
Causal LM model class: NeuronMllamaForCausalLM
Compatible Checkpoint Examples#
Mixtral#
NxD Inference supports models based on the Mixtral model architecture, which uses mixture-of-experts (MoE) architecture.
Neuron Classes#
Neuron config class: MoENeuronConfig
Inference config class: MixtralInferenceConfig
Causal LM model class: NeuronMixtralForCausalLM
Compatible Checkpoint Examples#
DBRX#
NxD Inference supports models based on the DBRX model architecture, which uses mixture-of-experts (MoE) architecture.
Neuron Classes#
Neuron config class: DbrxNeuronConfig
Inference config class: DbrxInferenceConfig
Causal LM model class: NeuronDbrxForCausalLM