Generative AI troubleshooting for Apache Spark in AWS Glue - AWS Glue

Generative AI troubleshooting for Apache Spark in AWS Glue

The generative AI troubleshooting for Apache Spark preview is available for jobs running on AWS Glue 4.0, and in the following AWS Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), Europe (Ireland), Europe (Stockholm), Asia Pacific (Tokyo), Asia Pacific (Mumbai), and Asia Pacific (Sydney). Preview features are subject to change.

Generative AI Troubleshooting for Apache Spark jobs in AWS Glue is a new capability that helps data engineers and scientists diagnose and fix issues in their Spark applications with ease. Utilizing machine learning and generative AI technologies, this feature analyzes issues in Spark jobs and provides detailed root cause analysis along with actionable recommendations to resolve those issues.

How does Generative AI Troubleshooting for Apache Spark work?

For your failed Spark jobs, Generative AI Troubleshooting analyzes the job metadata and the precise metrics and logs associated with the error signature of your job to generate a root cause analysis, and recommends specific solutions and best practices to help address job failures.

Setting up Generative AI Troubleshooting for Apache Spark for your jobs

Note

During preview, this feature helps troubleshoot AWS Glue 4.0 jobs that fail within the first 30 minutes of their execution time.

Configuring IAM permissions

Granting permissions to the APIs used by Spark Troubleshooting for your jobs in AWS Glue requires appropriate IAM permissions. You can obtain permissions by attaching the following custom AWS policy to your IAM identity (such as a user, role, or group).

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "glue:StartCompletion", "glue:GetCompletion" ], "Resource": [ "arn:aws:glue:*:*:completion/*" ] } ] }
Note

During preview, Spark Troubleshooting does not have APIs available through the AWS SDK that you can use programmatically. The following two APIs are used in the IAM policy for enabling this experience through the AWS Glue Studio Console: StartCompletion and GetCompletion.

Assigning permissions

To provide access, add permissions to your users, groups, or roles:

Running troubleshooting analysis from a failed job run

You can access the troubleshooting feature through multiple paths in the AWS Glue console. Here's how to get started:

Option 1: From the Jobs List page

  1. Open the AWS Glue console at https://console.aws.amazon.com/glue/.

  2. In the navigation pane, choose ETL Jobs.

  3. Locate your failed job in the jobs list.

  4. Select the Runs tab in the job details section.

  5. Click on the failed job run you want to analyze.

  6. Choose Troubleshoot with AI to start the analysis.

  7. When the troubleshooting analysis is complete, you can view the root-cause analysis and recommendations in the Troubleshooting analysis tab at the bottom of the screen.

The GIF shows an end to end implementation of a failed run and the troubleshoot with AI feature running.

Option 2: Using the Job Run Monitoring page

  1. Navigate to the Job run monitoring page.

  2. Locate your failed job run.

  3. Choose the Actions drop-down menu.

  4. Choose Troubleshoot with AI.

The GIF shows an end to end implementation of a failed run and the troubleshoot with AI feature running.

Option 3: From the Job Run Details page

  1. Navigate to your failed job run's details page by either clicking View details on a failed run from the Runs tab or selecting the job run from the Job run monitoring page.

  2. In the job run details page, find the Troubleshooting analysis tab.

Supported troubleshooting categories (preview)

This service focuses on three primary categories of issues that data engineers and developers frequently encounter in their Spark applications:

  • Resource setup and access errors: When running Spark applications in AWS Glue, resource setup and access errors are among the most common yet challenging issues to diagnose. These errors often occur when your Spark application attempts to interact with AWS resources but encounters permission issues, missing resources, or configuration problems.

  • Spark driver and executor memory issues: Memory-related errors in Apache Spark jobs can be complex to diagnose and resolve. These errors often manifest when your data processing requirements exceed the available memory resources, either on the driver node or executor nodes.

  • Spark disk capacity issues: Storage-related errors in AWS Glue Spark jobs often emerge during shuffle operations, data spilling, or when dealing with large-scale data transformations. These errors can be particularly tricky because they might not manifest until your job has been running for a while, potentially wasting valuable compute time and resources.

Note

Before implementing any suggested changes in your production environment, review the suggested changes thoroughly. The service provides recommendations based on patterns and best practices, but your specific use case might require additional considerations.