This document is relevant for: Trn1
, Trn2
PyTorch Neuron (torch-neuronx
) for Training Troubleshooting Guide#
This document shows common issues users may encounter while using PyTorch-Neuron and provides guidance how to resolve or work-around them.
General Troubleshooting#
For setting up EFA that is needed for multi-node training, please see How to prepare trn1.32xlarge for multi-node execution
For XLA-related troubleshooting notes see How to debug models in PyTorch Neuron and PyTorch-XLA troubleshooting guide.
If your multi-worker training run is interrupted, you may need to kill all the python processes (WARNING: this kills all python processes and reload the driver):
killall -9 python
killall -9 python3
sudo rmmod neuron; sudo modprobe neuron
To turn on RT debug:
os.environ["NEURON_RT_LOG_LEVEL"] = "INFO"
To turn on Neuron NCCL debug:
os.environ["NCCL_DEBUG"] = "WARN"
os.environ["NCCL_DEBUG_SUBSYS"] = "ALL"
If some process crashed during training, you can enable core dumps using ulimit
command:
ulimit -S -c unlimited
To see the type of signals that would cause core dumps, see https://www.man7.org/linux/man-pages/man7/signal.7.html.
Note that core dumps take significant amount of storage, so make sure there is enough free disk space before enabling core dumps.
On Ubuntu, if Apport is not running, core dump file name is by default “core” in the local directory. To change file location and name format, modify /proc/sys/kernel/core_pattern
(see https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#core-pattern for pattern info). For example, to dump to /tmp with executable filename and process ID:
echo '/tmp/core.%e.%p' | sudo tee /proc/sys/kernel/core_pattern
For containers, install appropriate dependencies during docker build (“apt-get update && apt-get -y install build-essential gdb”) and start the container with --ulimit core=-1
to enable core dump and -v /tmp/:/tmp/
to ensure core dumps to /tmp
are preserved when container is stopped or deleted. Dependencies can also be installed after container is started.
On Ubuntu, core dumps can also handled by Apport which is disabled by default. To enable Apport, run sudo service apport start
. The /proc/sys/kernel/core_pattern
is updated by Apport service. After a crash, look in /var/log/apport.log
for the core dump file name, which should be in located in /var/lib/apport/coredump/
.
Once you have the core dump, you can use gdb to debug further (for Python applications, <executable> is python
or python3
):
gdb <executable> <core file>
If some process (i.e. XRT server) is killed due to out-of-memory on host (i.e. you see Out of memory: Killed process <PID>
in /var/log/syslog
or output of dmesg
), there won’t be any core dump generated. However, you can change to it to kernel panic mode to trigger core dump by setting /proc/sys/vm/panic_on_oom
to value of 1 on the host or from inside container.
On the host where you need sudo
(this change will be reflected inside the container also):
echo 1 | sudo tee /proc/sys/vm/panic_on_oom
From inside container where sudo
doesn’t work (this change will be reflected on the host also):
echo 1 > /proc/sys/vm/panic_on_oom
Possible Error Conditions#
Eager debug mode fails with “urllib3.exceptions.URLSchemeUnknown: Not supported URL scheme http+unix”#
When running with eager debug mode (NEURON_USE_EAGER_DEBUG_MODE=1) using torch-neuronx
and neuronx-cc
from releases 2.19.1 and 2.20, you may see the following error:
urllib3.exceptions.URLSchemeUnknown: Not supported URL scheme http+unix
This error is due to requests
version >= 2.32. While neuronx-cc
pins requests
package version be less than 2.32, installing other packages like transformers
could bring in a newer version of requests
. To work-around this, you can pin requests
to version 2.31.0 with the following command, which also include urllib3
pinning due to a related issue noted in the next note:
pip install requests==2.31.0 urllib3==1.26.20
Eager debug mode fails with “TypeError: HTTPConnection.request() got an unexpected keyword argument ‘chunked’”#
When running with eager debug mode (NEURON_USE_EAGER_DEBUG_MODE=1) using torch-neuronx
and neuronx-cc
from releases 2.19.1 and 2.20, you may see the following error:
TypeError: HTTPConnection.request() got an unexpected keyword argument 'chunked'
This error is due to urllib3
version >= 2.* and can be a dependency of requests
< 2.32. To work-around this, you can pin urllib3
to version 1.26.20 with the following command (which also include requests
pinning due a related issue noted the previous note):
pip install requests==2.31.0 urllib3==1.26.20
Non-Fatal Error OpKernel (‘op: “TPU*” device_type: “CPU”’)#
During execution using PyTorch Neuron, you may see these non-fatal error messages:
E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
They don’t affect operation of the PyTorch Neuron and can be ignored.
XLA runtime error: “Invalid argument: Cannot assign a device for operation”#
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:490 : Check failed: session->session()->Run(session_work->feed_inputs, session_work->outputs_handles, &outputs) == ::tensorflow::Status::OK() (INVALID_ARGUMENT: Cannot assign a device for operation XRTAllocateFromTensor: {{node XRTAllocateFromTensor}} was explicitly assigned to /job:localservice/replica:0/task:0/device:TPU:0 but available devices are [ /job:localservice/replica:0/task:0/device:CPU:0, /job:localservice/replica:0/task:0/device:TPU_SYSTEM:0, /job:localservice/replica:0/task:0/device:XLA_CPU:0 ]. Make sure the device specification refers to a valid device.
[[XRTAllocateFromTensor]] vs. OK)
*** Begin stack trace ***
tensorflow::CurrentStackTrace()
xla::util::MultiWait::Complete(std::function<void ()> const&)
clone
*** End stack trace ***
The above error indicates that the framework was not able to initialize the neuron runtime. If you get the above error, check for the following:
No other process is taking the neuron cores. If yes, you may have to kill that process.
If no process is running, try reloading the driver using
sudo rmmod neuron; sudo modprobe neuron
Error: “Could not start gRPC server”#
If you get “Could not start gRPC server” error, please check if there are any leftover python processes from a previous interrupted run and terminate them before restarting run.
E0207 17:22:12.592127280 30834 server_chttp2.cc:40] {"created":"@1644254532.592081429","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/t
ransport/chttp2/server/chttp2_server.cc","file_line":395,"referenced_errors":[{"created":"@1644254532.592078907","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/s
rc/core/lib/iomgr/tcp_server_posix.cc","file_line":342,"referenced_errors":[{"created":"@1644254532.592072626","description":"Unable to configure socket","fd":10,"file":"external/com_github_grpc_grpc/src/c
ore/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1644254532.592068939","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":189,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1644254532.592078512","description":"Unable to configure socket"
,"fd":10,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1644254532.592077123","description":"Address already in
use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":189,"os_error":"Address already in use","syscall":"bind"}]}]}]}
2022-02-07 17:22:12.592170: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:545] Unknown: Could not start gRPC server
Failed compilation result in the cache#
All compilation results are by default saved in Neuron Persistent Cache
. If the Neuron Compiler
fails to compile a graph, we save the failed result in the cache. The reason for doing so is, if
the user tries to run the same script, we want the users to error out early rather than wait for
the compilation to progress and see an error at the later stage. However, there could be certain
cases under which a failed compilation may be do you some environment issues. One possible reason
of failure could be, during compilation the process went out of memory. This can happen if you are
running multiple processes in parallel such that not enough memory is available for compilation of
graph. Failure due to such reasons can be easily mitigated by re-running the compilation. In case,
you want to retry a failed compilation, you can do that by passing --retry_failed_compilation
as follows:
os.environ['NEURON_CC_FLAGS'] = os.environ.get('NEURON_CC_FLAGS', '') + ' --retry_failed_compilation'
This would retry the compilation and would replace a failed result in the cache with a successful compilation result.
Compilation errors when placing NeuronCache home directory on NFS/EFS/FSx mounted drive#
Currently, NeuronCache default root directory is /var/tmp which is local to the instance you are running on. You can modify the location of the NeuronCache root directory using NEURON_CC_FLAGS='--cache_dir=<root dir>'
. However, when the NeuronCache directory is placed in a directory that is part of a NFS mounted drive shared among multiple instances, you may encounter file errors such as file not found, file corruption, or KeyError when running multi-instance training:
KeyError: 'neff_cache2/neuron-compile-cache/USER_neuroncc-1.0.48875.0+7437fbf18/MODULE_7223055628515330524/MODULE_0_SyncTensorsGraph.14_7223055628515330524_compute1-dy-training-2-1-e859998e-3035-5df63dab5ce63'
This is a result of limitations to file locking on NFS. EFS/FSx also exhibit similar limitation. The workaround is to setup separate NeuronCache root directories for each worker instance, such as NEURON_CC_FLAGS="--cache_dir=$HOME/neuron_cache/bert/`hostname`"
, where the home directory is shared among worker instances as in ParallelCluster.
Consider the use case of a ParallelCluster with SLURM cluster management. The home directory of the head node is shared via NFS with worker instances. Also, SLURM would terminate the idle worker instances when the cluster is configured as dynamic auto-scaling cluster, and the default cache in the terminated worker instance’s /var/tmp is deleted. So to persist the cache across runs separated by a cluster idle period, we use the workaround above to create separate NeuronCache root directories for each worker instance. For example, see BERT ParallelCluster script.
Compilation error: “Expect ap datatype to be of type float32 float16 bfloat16 uint8”#
If an XLA example fails to run because of failed compilation and one of
the error messages is “Expect ap datatype to be of type float32 float16
bfloat16 uint8”, then please set the environment variable
XLA_USE_32BIT_LONG=1
in your script:
os.environ['XLA_USE_32BIT_LONG'] = '1'
11/18/2021 04:51:25 PM WARNING 34567 [StaticProfiler]: matmul-based transposes inserted by penguin takes up 93.66 percent of all matmul computation
terminate called after throwing an instance of 'std::runtime_error'
what(): === BIR verification failed ===
Reason: Expect ap datatype to be of type float32 float16 bfloat16 uint8
Instruction: I-545-0
Opcode: Matmult
Input index: 0
Argument AP:
Access Pattern: [[1,8],[1,1],[1,1]]
Offset: 0
Memory Location: {compare.85-t604_i0}@SB<0,0>(8x2)#Internal DebugInfo: <compare.85||uint16||UNDEF||[8, 1, 1]>
NeuronCore(s) not available - Requested:1 Available:0#
When you see “NeuronCore(s) not available” please terminate processes that may be holding the NeuronCores and terminate any neuron-top sessions that are running. Also check if someone else is using the system. Then do “sudo rmmod neuron; sudo modprobe neuron” to reload the driver.
2021-Nov-15 15:21:28.0231 7245:7245 ERROR NRT:nrt_allocate_neuron_cores NeuronCore(s) not available - Requested:nc1-nc1 Available:0
2021-11-15 15:21:28.231864: F ./tensorflow/compiler/xla/service/neuron/neuron_runtime.h:1037] Check failed: status == NRT_SUCCESS NEURONPOC : nrt_init failed. Status = 1
Often when you run multi-worker training, there can be many python processes leftover after a run is interrupted. To kill all python processes, run the follow (WARNING: this kills all python processes on the system) then reload the driver:
killall -9 python
killall -9 python3
sudo rmmod neuron; sudo modprobe neuron
TDRV error “TDRV:exec_consume_infer_status_notification”#
If you see TDRV error “TDRV:exec_consume_infer_status_notification”, try reloading the driver using sudo modprobe -r neuron; sudo modprobe neuron;
.
2022-Mar-10 18:51:19.07392022-Mar-10 18:51:19.0739 17821:17931 ERROR TDRV:exec_consume_infer_status_notifications 17822:18046 ERROR TDRV:exec_consume_infer_status_notifications Unexpected number of CC notifications: mod->cc_op_count=1, cc_start_cnt=0, cc_end_cnt=0Unexpected number of CC notifications: mod->cc_op_count=1, cc_start_cnt=0, cc_end_cnt=0
2022-Mar-10 18:51:19.07392022-Mar-10 18:51:19.0739 17821:17931 ERROR TDRV:exec_consume_infer_status_notifications 17822:18046 ERROR TDRV:exec_consume_infer_status_notifications (NON-FATAL, Ignoring) inference timeout (180000 ms) on Neuron Device 0 NC 0, waiting for cc status notifications.
(NON-FATAL, Ignoring) inference timeout (180000 ms) on Neuron Device 0 NC 1, waiting for cc status notifications.
TDRV error “TDRV:tdrv_one_tmpbuf_reserve Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: <N>, max allowed: 16).”#
If you see the TDRV error “TDRV:tdrv_one_tmpbuf_reserve Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: <N>, max allowed: 16)”, it maybe due to model tensors requiring more device memory then available. A solution is to try training with a smaller data batch size.
ERROR TDRV:tdrv_one_tmpbuf_reserve Number of ONE TMPBUF pages requested exceeded the max number of pages allowed (requested: 28, max allowed: 16).
ERROR TDRV:copy_and_stage_mr Failed to reserve one tmpbuf memory
ERROR TDRV:kbl_model_add copy_and_stage_mr() error
W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1669183391.155135683","description":"Error received from peer ipv4:172.31.58.24:43941","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
Could not open the ndX, close device failed, TDRV not initialized#
If you see error messages stating “Could not open the ndX” (where X is
an integer from 0..15), please run neuron-ls
and ensure that you are
able to see all 16 Neuron devices in the output. If one or more devices
are missing please report the issue to aws-neuron-support@amazon.com with the instance ID and a screen capture of neuron-ls
output.
2021-Nov-11 15:33:20.0161 7912:7912 ERROR TDRV:tdrv_init_mla_phase1 Could not open the nd0
2021-Nov-11 15:33:20.0161 7912:7912 ERROR TDRV:tdrv_destroy_one_mla close device failed
2021-Nov-11 15:33:20.0161 7912:7912 ERROR TDRV:tdrv_destroy TDRV not initialized
2021-Nov-11 15:33:20.0161 7912:7912 ERROR NRT:nrt_init Failed to initialize devices, error:1
2021-11-11 15:33:20.161331: F ./tensorflow/compiler/xla/service/neuron/neuron_runtime.h:1033] Check failed: status == NRT_SUCCESS NEURONPOC : nrt_init failed. Status = 1
Multiworker execution hangs during NCCL init#
When your multi-worker execution hangs during NCCL init, you can try to
reserve the port used by environment variable NEURON_RT_ROOT_COMM_ID
by (here we use host:port localhost:48620 as an example but you can use
any free port and root node’s host IP):
sudo sysctl -w net.ipv4.ip_local_reserved_ports=48620
Then set the environment variable NEURON_RT_ROOT_COMM_ID
in your
script:
os.environ["NEURON_RT_ROOT_COMM_ID"] = "localhost:48620"
NRT init error “One or more engines are running. Please restart device by reloading driver”#
If you see an error stating “One or more engines are running. Please
restart device by reloading driver” please follow the instruction and
reload the driver using
“sudo modprobe -r neuron; sudo modprobe neuron;
”.
2021-Nov-15 20:23:27.0280 3793:3793 ERROR TDRV:tpb_eng_init_hals_v2 CRITICAL HW ERROR: One or more engines are running. Please restart device by reloading driver:
sudo modprobe -r neuron; sudo modprobe neuron;
2021-Nov-15 20:23:27.0280 3793:3793 ERROR TDRV:tdrv_init_one_mla_phase2 nd0 nc0 HAL init failed. error:1
NRT error “ERROR TDRV:kbl_model_add Attempting to load an incompatible model!”#
If you see an NRT error “ERROR TDRV:kbl_model_add Attempting to load an incompatible model!” this means that the compiler neuronx-cc used to compile the model is too old. See installation instruction to update to latest compiler.
NRT error “ERROR HAL:aws_hal_sprot_config_remap_entry SPROT remap destination address must be aligned size”#
If you see an NRT error “ERROR HAL:aws_hal_sprot_config_remap_entry SPROT remap destination address must be aligned size”, please check the kernel version and upgrade it to the distribution’s latest kernel.
For example, on Ubuntu 18.04.6 LTS, the kernel version 4.15.0-66-generic is known to cause this error when running MLP tutorial. This is due to a known bug in the kernel in aligned memory allocation. To fix this issue, please upgrade your kernel to latest version (i.e. 4.15.0-171-generic):
uname -a
sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade
Please reboot after the upgrade. Use “uname -a” to check kernel version again after reboot.
NCCL warning : “NCCL WARN Timeout waiting for RX (waited 120 sec) - retrying”#
When running multi-worker training, if a graph has collective communication operator like an
all_reduce
, it requires all the workers involved in the collective communication to load the
graph in the runtime at approximately same time. If any of the worker doesn’t load the graph
within a 120 sec window from the first model load by any of the worker, you would see warnings
like NCCL WARN Timeout waiting for RX (waited 120 sec) - retrying
. When you see such warnings
check for the following in the log messages:
1. One of the workers is compiling a graph: In multi-worker training, there is a chance that
each worker builds a slightly different graph. This would result in cache miss and can result
in compilation. Since the compilations during training run are serialized, the first worker
can compile and load the graph with collective communication. It would then wait for 120 secs
for other works to join. If they don’t show up because they are compiling their own graphs,
first worker would start throwing a warning message as above. The warning in this case is
non-fatal
and would go away once all workers have compiled their respective graphs and then loaded
them. To identify this scenario, look for No candidate found under ....
logs around the warning.
You should also see .....
which indicates compilation is in progress.
2. Server on one of the nodes crashed: In distributed training across multiple nodes, if the server on one
node crashed, the workers on other nodes would keep waiting on model load and you would see above
timeout
logs on those nodes. To identify if the server crashed, check if you see the following
error on any of the nodes:
`RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1664146011.016500243","description":"Error received from peer ipv4:10.1.24.109:37379","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC`
If you see the above error, then it means there is a server crash and you need to cancel the traning run.
Runtime errors “Missing infer_status notification” followed by “inference timeout”#
If you get a timeout error like below:
ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:4)
ERROR TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) inference timeout (600000 ms) on Neuron Device 4 NC 1, waiting for execution completion notification
It maybe due to long graph execution time causing synchronization delays
exceeding the default timeout. Please try increasing the timeout to
larger value using NEURON_RT_EXEC_TIMEOUT
(unit in seconds) and
see if the problem is resolved.
Protobuf Error “TypeError: Descriptors cannot not be created directly.”#
If you install torch-neuronx after neuronx-cc, you may get the Protobuf error “TypeError: Descriptors cannot not be created directly.”. To fix this, please reinstall neuronx-cc using “pip install –force-reinstall neuronx-cc”.
Traceback (most recent call last):
File "./run_glue.py", line 570, in <module>
main()
File "./run_glue.py", line 478, in main
data_collator=data_collator,
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/transformers/trainer.py", line 399, in __init__
callbacks, self.model, self.tokenizer, self.optimizer, self.lr_scheduler
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/transformers/trainer_callback.py", line 292, in __init__
self.add_callback(cb)
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/transformers/trainer_callback.py", line 309, in add_callback
cb = callback() if isinstance(callback, type) else callback
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/transformers/integrations.py", line 390, in __init__
from torch.utils.tensorboard import SummaryWriter # noqa: F401
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/torch/utils/tensorboard/__init__.py", line 10, in <module>
from .writer import FileWriter, SummaryWriter # noqa: F401
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 9, in <module>
from tensorboard.compat.proto.event_pb2 import SessionLog
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/tensorboard/compat/proto/event_pb2.py", line 17, in <module>
from tensorboard.compat.proto import summary_pb2 as tensorboard_dot_compat_dot_proto_dot_summary__pb2
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/tensorboard/compat/proto/summary_pb2.py", line 17, in <module>
from tensorboard.compat.proto import tensor_pb2 as tensorboard_dot_compat_dot_proto_dot_tensor__pb2
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/tensorboard/compat/proto/tensor_pb2.py", line 16, in <module>
from tensorboard.compat.proto import resource_handle_pb2 as tensorboard_dot_compat_dot_proto_dot_resource__handle__pb2
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/tensorboard/compat/proto/resource_handle_pb2.py", line 16, in <module>
from tensorboard.compat.proto import tensor_shape_pb2 as tensorboard_dot_compat_dot_proto_dot_tensor__shape__pb2
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/tensorboard/compat/proto/tensor_shape_pb2.py", line 42, in <module>
serialized_options=None, file=DESCRIPTOR),
File "/home/ec2-user/aws_neuron_venv_pytorch_p37_exp/lib64/python3.7/site-packages/google/protobuf/descriptor.py", line 560, in __new__
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
TDRV error “Timestamp program stop timeout”#
If you see TDRV error “Timestamp program stop timeout”, i.e. when rerunning a training script after it was interrupted, try first reloading the driver using sudo modprobe -r neuron; sudo modprobe neuron;
(make sure neuron-top and/or neuron-monitor are not running).
2022-Aug-31 04:59:21.0546 117717:117717 ERROR TDRV:tsync_wait_eng_stop nd0 nc0 Timestamp program stop timeout (1000 ms)
2022-Aug-31 04:59:21.0546 117717:117717 ERROR TDRV:tsync_wait_nc_stop nd0 nc0 Error while waiting for timestamp program to end on TPB eng 0
2022-Aug-31 04:59:21.0546 117717:117717 ERROR TDRV:tsync_timestamps_finish nd0 nc0 Failed to stop neuron core
2022-Aug-31 04:59:21.0546 117717:117717 ERROR TDRV:tdrv_tsync_timestamps nd0 nc0 Failed to end timestamp sync programs
2022-Aug-31 04:59:22.0768 117717:117717 ERROR TDRV:tdrv_destroy TDRV not initialized
2022-Aug-31 04:59:22.0768 117717:117717 ERROR NRT:nrt_init Failed to initialize devices, error:5
Compiler error “module ‘numpy’ has no attribute ‘asscalar’”#
When you have a newer version of numpy in the Python environment, compilations may fail with the “error module ‘numpy’ has no attribute ‘asscalar’”. Please note the neuronx-cc has the following dependency on numpy “numpy<=1.20.0,>=1.13.3”. To workaround this error, please do “pip install –force-reinstall neuronx-cc” to reinstall neuronx-cc with the proper dependencies.
ERROR 227874 [neuronx-cc]: ***************************************************************
ERROR 227874 [neuronx-cc]: An Internal Compiler Error has occurred
ERROR 227874 [neuronx-cc]: ***************************************************************
ERROR 227874 [neuronx-cc]:
ERROR 227874 [neuronx-cc]: Error message: module 'numpy' has no attribute 'asscalar'
ERROR 227874 [neuronx-cc]:
ERROR 227874 [neuronx-cc]: Error class: AttributeError
ERROR 227874 [neuronx-cc]: Error location: Unknown
ERROR 227874 [neuronx-cc]: Version information:
ERROR 227874 [neuronx-cc]: NeuronX Compiler version 2.1.0.76+2909d26a2
ERROR 227874 [neuronx-cc]:
ERROR 227874 [neuronx-cc]: HWM version 2.1.0.7-64eaede08
ERROR 227874 [neuronx-cc]: NEFF version Dynamic
ERROR 227874 [neuronx-cc]: TVM not available
ERROR 227874 [neuronx-cc]: NumPy version 1.23.3
ERROR 227874 [neuronx-cc]: MXNet not available
ERROR 227874 [neuronx-cc]:
Import errors ‘generic_type: type “IrValue” is already registered!’ or ‘generic_type: type “XlaBuilder” is already registered!’#
When you encounter a PyTorch import error ‘import _XLAC … generic_type: type “IrValue” is already registered!’ or ‘import _XLAC … generic_type: type “XlaBuilder” is already registered!’, please check that TensorFlow and/or JAX are not installed in the Python environment. If they are installed, please uninstall them.
Import error “import _XLAC ImportError: <>/site-packages/_XLAC.cpython-38-x86_64-linux-gnu.so: undefined symbol”#
- When you encounter a PyTorch import error “import _XLAC ImportError: <>/site-packages/_XLAC.cpython-38-x86_64-linux-gnu.so: undefined symbol” during execution, please check:
TensorFlow and/or JAX are not installed in the Python environment. If they are installed, please uninstall them.
The installed PyTorch (torch) package major/minor versions match the installed torch-neuronx package’s major/minor versions (ie. 1.11). If they don’t match, please install the version of PyTorch that matches torch-neuronx.
Traceback (most recent call last):
File "/opt/ml/mlp_train.py", line 11, in <module>
import torch_xla.core.xla_model as xm
File "/usr/local/lib/python3.8/site-packages/torch_xla/__init__.py", line 117, in <module>
import _XLAC
ImportError: /usr/local/lib/python3.8/site-packages/_XLAC.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1010TensorImpl7stridesEv
NaNs seen with transformers version >= 4.21.0 when running HF BERT fine-tuning or pretraining with XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1#
When running HuggingFace BERT (any size) fine-tuning tutorial or pretraining tutorial with transformers version >= 4.21.0 and using XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1, you will see NaNs in the loss immediately at the first step. More details on the issue can be found at pytorch/xla#4152. The workaround is to use 4.20.0 or earlier (the tutorials currently recommend version 4.15.0) or add transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16
to the Python script.
Network Connectivity Issue on trn1/trn1n 32xlarge with Ubuntu#
Description
Ubuntu distributions have network connectivity issues when multiple interfaces are connected to the same subnet. trn1/trn1n 32xlarge comes with 8/16 network interfaces. (To launch trn1/trn1n with 8/16 interfaces please follow here)
AWS publishes a package that installs a helper service to address the issue. This service runs at the startup, creates the appropriate netplan files, updates the netplan and the the instance networking and terminates.
Note that the following fix is only required on instances launched using generic Ubuntu AMIs. Neuron AMIs and instances launched via ParalleCluster do not require the fix.
Patch to fix networking on a multi-interface instance
wget -O /tmp/aws-ubuntu-eni-helper.deb 'https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-ami_base/networking/aws-ubuntu-eni-helper_0.3-1_all.deb?raw=true'
sudo apt install /tmp/aws-ubuntu-eni-helper.deb -y
sudo systemctl enable aws-ubuntu-eni-helper.service
sudo systemctl start aws-ubuntu-eni-helper.service
How to apply the patch?
The following steps could be followed to resolve this issue:
Launch trn1.32xl from AWS console (starts with
single interface
, does not suffer from the multi-interface issue)Apply the patch on this newly launched single-interface instance
Create a new AMI from this instance
Launch an 8 or 16 interface instance using that AMI.
Note
The patch installs and enables the service but does not run it. This is intentional. The service will run at the startup when the AMI is used to launch a multi-interface instance.
FAQs
Note
Neuron DLAMI has the patch installed, users are always encouraged to launch the instances using the DLAMI which does not require any fix. Please refer to the Set Up Guide to know how to launch an instance using DLAMI.
“Too many open files” when running training job#
When running a large model training with several workers, it can result in errors like the following.
2023-Jun-14 19:05:29.0312 4112959:4113326 [23] bootstrap.cc:106 CCOM WARN Call to accept failed : Too many open files
2023-Jun-14 19:05:29.0312 4112959:4113263 [14] include/socket.h:438 CCOM WARN Net : Socket creation failed : Too many open files
2023-Jun-14 19:05:29.0312 4112959:4113326 ERROR ENC:ncclBootstrapRecv failed neuronBootstrapRecv request to NCCL
2023-Jun-14 19:05:29.0312 4112959:4113249 [12] bootstrap.cc:106 CCOM WARN Call to accept failed : Too many open files
2023-Jun-14 19:05:29.0312 4112959:4113263 ERROR ENC:ncclBootstrapSend failed neuronBootstrapSend request to NCCL2023-Jun-14 19:05:29.03122023-Jun-14 19:05:29.0312 4112959:4113270 [15] bootstrap.cc:106 CCOM WARN Call to accept failed : Too many open files
This can result when the default OS limits is low. The hard and soft limits can be set on OS using the following commands or by manually opening and setting the limits.
sudo sed -i 'H;1h;$!d;x;/hard *nofile/!s/$/\n* hard nofile 65536/' /etc/security/limits.conf
sudo sed -i 'H;1h;$!d;x;/soft *nofile/!s/$/\n* soft nofile 65536/' /etc/security/limits.conf
sudo sed -i 's/^#*\(\*\|\s*\*\)\s*soft\s*nofile\s*[0-9]\+$/\1 soft nofile 65536/' /etc/security/limits.conf
sudo sed -i 's/^#*\(\*\|\s*\*\)\s*hard\s*nofile\s*[0-9]\+$/\1 hard nofile 65536/' /etc/security/limits.conf
sudo sed -i 's/^#*\(\*\|\s*\*\)\s*soft\s*nofile\s*[0-9]\+$/\1 soft nofile 65536/' /etc/security/limits.d/01_efa.conf || true
sudo sed -i 's/^#*\(\*\|\s*\*\)\s*hard\s*nofile\s*[0-9]\+$/\1 hard nofile 65536/' /etc/security/limits.d/01_efa.conf || true
The 01_efa.conf file is created as part of the EFA installation and needs to be updated. If EFA driver is not installed the file 01_efa.conf doesn’t exist and the sed commands will fail with No such file or directory. If there are other files under limits.d with file limits they need to be updated as well.
“undefined symbol”#
To maintain compatibility with the packages vended publicly in Pypi, AWS Neuron python packages contain binary extensions that are compiled with the pre-2011 libstdc++ application binary interface (ABI). If a custom version of a package - such as torch - is compiled using a modern compiler, it can result in “undefined symbol” errors due to mismatches between the package and AWS Neuron package.
To support this situation, we provide alternative versions of AWS Neuron packages that are compiled according to the newer 2011 ABI. For information on how to use these packages, see Install with support for cxx11 ABI.
This document is relevant for: Trn1
, Trn2