Replacing unhealthy nodes with Amazon EMR
Amazon EMR periodically uses the NodeManager health checker service in Apache Hadoop to monitor the statuses of core nodes in your
Amazon EMR on Amazon EC2 clusters. If a node is not functionally optimally, the health checker reports that node
to the Amazon EMR controller. The Amazon EMR controller adds the node to a denylist,
preventing the node from receiving new YARN applications until the status of the node improves.
One common reason of why a node might become unhealthy is because of overutilizing the disk.
For more information about identifying unhealthy nodes and recovery, see
Resource errors.
You can choose whether Amazon EMR should terminate unhealthy nodes or keep them in the cluster.
If you turn off unhealthy node replacement, the unhealthy nodes stay in the
denylist and continue to count towards cluster capacity. You can still connect to your Amazon EC2 core instance
for configuration and recovery, so you can resize your cluster to add capacity. Note that
Amazon EMR will replace unhealthy nodes even if termination protection is on.
If unhealthy node replacement is on, Amazon EMR will terminate the unhealthy core node and provision a new instance
based on the number of instances in the instance group or the target capacity for instance fleets. If multiple or
all core nodes are unhealthy for more than 45 minutes, Amazon EMR will gracefully replace the nodes.
To avoid the possibility of permanently losing HDFS data as Amazon EMR gracefully replaces an unhealthy core instance, we recommend that you
always back up your data.
Amazon EMR publishes Amazon CloudWatch Events for unhealthy node replacement, so you can keep track of what's happening
with your unhealthy core instances. For more information, see
unhealthy node replacement events.
Default node replacement and termination protection settings
Unhealthy node replacement is available for all Amazon EMR releases, but the default settings depend on the
release label you choose. You can change any of these settings by configuring unhealthy node replacement
when creating a new cluster or by going to cluster configuration at any time.
If you're creating a single-node cluster or high-availability cluster that is running Amazon EMR release 7.0 or lower, the default
setting of unhealthy node replacement is dependent on termination protection:
Configuring unhealthy node replacement when you launch a cluster
You can enable or disable unhealthy node replacement when you launch a cluster using the console, the AWS CLI, or the API.
The default unhealthy node replacement setting depends on how you launch the cluster:
-
Amazon EMR console — unhealthy node replacement is enabled by default.
-
AWS CLI aws emr create-cluster
— unhealthy node replacement is enabled by default unless you specify
--no-unhealthy-node-replacement
.
-
Amazon EMR RunJobFlow API command — unhealthy node replacement is enabled by default unless you set the UnhealthyNodeReplacement
Boolean value
to True
or False
.
- Console
-
To turn unhealthy node replacement on or off when you create a
cluster with the console
-
Sign in to the AWS Management Console, and open the Amazon EMR console at
https://console.aws.amazon.com/emr.
-
Under EMR on EC2 in the left navigation
pane, choose Clusters, and then choose
Create cluster.
-
For EMR release version, choose
the Amazon EMR release label you want.
-
Under Cluster termination and node replacement, make sure that
Unhealthy node replacement (recommended) is
pre-selected, or clear the selection to turn it off.
-
Choose any other options that apply to your cluster.
-
To launch your cluster, choose Create
cluster.
- AWS CLI
-
To turn unhealthy node replacement on or off when you create a
cluster using the AWS CLI
-
With the AWS CLI, you can launch a cluster with unhealthy node replacement
enabled with the create-cluster
command
with the --unhealthy-node-replacement
parameter.
Unhealthy node replacement is on by default.
The following example creates a cluster with unhealthy node replacement enabled:
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
aws emr create-cluster --name "SampleCluster
" --release-label emr-7.5.0
\
--applications Name=Hadoop
Name=Hive
Name=Pig
\
--use-default-roles --ec2-attributes KeyName=myKey
--instance-type m5.xlarge
\
--instance-count 3
--unhealthy-node-replacement
For more information about using Amazon EMR commands in the AWS CLI,
see Amazon EMR AWS CLI commands.
Configuring unhealthy node replacement in a running cluster
You can turn unhealthy node replacement on or off for a running cluster using the console, the AWS CLI, or the API.
- Console
-
To turn unhealthy node replacement on or off for a running cluster
with the console
-
Sign in to the AWS Management Console, and open the Amazon EMR console at
https://console.aws.amazon.com/emr.
-
Under EMR on EC2 in the left navigation
pane, choose Clusters, and select the
cluster that you want to update.
-
On the Properties tab on the cluster
details page, find Cluster termination and node replacement and
select Edit.
-
Select or clear the unhealthy node replacement check box to turn the feature on or
off. Then select Save changes to
confirm.
- AWS CLI
-
To turn unhealthy node replacement on or off for a running cluster
using the AWS CLI
-
To turn on unhealthy node replacement on a running cluster with the
AWS CLI, use the modify-cluster-attributes
command
with the --unhealthy-node-replacement
parameter. To
disable it, use the --no-unhealthy-node-replacement
parameter.
The following example turns on unhealthy node replacement on the
cluster with ID
j-3KVTXXXXXX7UG
:
aws emr modify-cluster-attributes --cluster-id j-3KVTXXXXXX7UG
--unhealthy-node-replacement
The following example turns off unhealthy node replacement on the
same cluster:
aws emr modify-cluster-attributes --cluster-id j-3KVTXXXXXX7UG
--no-unhealthy-node-replacement