Harness the power of artificial intelligence (AI) and machine learning (ML) using Splunk and Amazon SageMaker Canvas
The blog post above explores how Amazon SageMaker Canvas, a no-code ML development service, can be used in conjunction with data collected in Splunk to drive actionable insights.
The adaptable approach detailed in it starts with an automated data engineering pipeline to make data stored in Splunk available to a wide range of personas, including business intelligence (BI) analysts, data scientists and ML practitioners, through a structured query language (SQL) interface. This is achieved by using the pipeline to transfer data from a Splunk index into an Amazon S3 bucket where it will be cataloged.
This aws-samples
repository houses an AWS Serverless Application Model (AWS SAM) template and related Python code that demonstrates this pipeline in action.
For this walkthrough, you will need the following prerequisites in place:
- An AWS account
- AWS Identity and Access Management (IAM) permissions to deploy the AWS resources using AWS SAM
- Latest version of AWS SAM Command Line Interface (CLI) installed locally
- Local installation of Python 3.11 and pip
- The Lambda function used by this solution uses a container image which is built locally at the time of deployment. Install Docker on your operating system if it is not already installed and running.
This solution uses AWS Secrets Manager to store the Splunk bearer authentication token. The bearer authentication token is retrieved and used by the AWS Lambda function when accessing Splunk's Search REST API endpoint. Follow Splunk's Create authentication tokens document for steps to create the bearer authentication token.
Important: If your Splunk instance has IP allow lists configured, confirm that IP restrictions are in place that allow you to access the Splunk Search REST API endpoint programmatically.
Once you have the token generated, follow the steps below to store it using Secrets Manager:
-
In the AWS console, navigate to AWS Secrets Manager and select Store a new secret to create a new secret.
-
Select Other type of secret and enter the following details as key/value pairs.
Key | Value |
---|---|
SplunkBearerToken |
Splunk bearer token retrieved from Splunk. This is used when authenticating against the Splunk search REST API endpoint. |
SplunkDeploymentName |
Name of your Splunk deployment. This is used when constructing the search REST API endpoint URL for Splunk Cloud. For example, if your Splunk Cloud deployment is test.splunkcloud.com , this value will be test . |
SplunkIndexName |
Name of the Splunk index used to verify connectivity. |
- Enter
SplunkDeployment
as the Secret name. Select Next. Complete the remaining configuration of the new secret with default settings.
The data that is exported from Splunk to the S3 bucket is controlled by the configuration.json
file. This file includes the Splunk Search Processing Language (SPL) query used to retrieve the results.
Important: The file currently contains default values. Update the file as required before deployment.
-
Replace
mysource
,myindex
andmysourcetype
values as required by your SPL query. We recommend that you test the SPL query in Splunk with limited data first to ensure that it is returning expected results. -
Replace
mypath
with the name of the sub-folder in which the exported data is stored. The S3 bucket name itself is auto-generated by the AWS SAM template deployment, and the top-level folder is determined via thesplunkDataExportTopLevelPath
CloudFormation parameter during deployment time. -
Replace
myid
with the field that you wish to use as the partition key (for example,userid
). Some analytics tooling expect data stored in the field used for constructing the partitioned folder structure to also be duplicated in a non-partitioned column. Replacemyidcopy
with the name of a new column which duplicates this data (e.g.userid_copy
).
{
"searches":
[
{
"spl": {
"search_query": "search source=mysource index=myindex sourcetype=mysourcetype | table *",
"search_export": {
"earliest_time": "0",
"latest_time": "now",
"enable_lookups": "true",
"parse_only": "false",
"count": "50000",
"output_mode": "json"
}
},
"s3": {
"path": "mypath/",
"partition_cols": ["myid"],
"partition_cols_duplicate": ["myid_copy"],
"engine": "pyarrow",
"compression": "snappy"
}
}
]
}
Visit the Splunk Developer Tools site for more information about the Splunk Enterprise SDK for Python.
Before starting, confirm that the latest version of the AWS SAM CLI is installed by running sam --version
.
Note: The AWS SAM CLI requires appropriate permissions to provision resources in the chosen AWS account. Ensure that access key and secret access keys have been created using IAM, and that
aws configure
has been used to register them locally on your machine.
- To download all required files to your local machine, run the following command.
git clone https://github.com/aws-samples/harness-the-power-of-ai-and-ml-using-splunk-and-amazon-sagemaker-canvas.git
Note: If you are unable to use the
git
command, simply download the source code from Code > Download source code. Unzip the zip file in your chosen directory.
- Navigate into the
harness-the-power-of-ai-and-ml-using-splunk-and-amazon-sagemaker-canvas
folder.
cd harness-the-power-of-ai-and-ml-using-splunk-and-amazon-sagemaker-canvas
- Build the SAM application.
sam build
- Confirm that the
Build Succeeded
message is displayed.
- Deploy the application.
sam deploy --guided
- When prompted, enter the unique details chosen for your CloudFormation stack. In this example, we have chosen the CloudFormation stack name
splunkDataExport
and kept the remainder of the options as defaults.
Stack Name [sam-app]: splunkDataExport
AWS Region [eu-west-1]: us-east-1
Parameter splunkSecretsManagerSecret [SplunkDeployment]:
Parameter splunkDataExportConfigurationFile [configuration.json]:
Parameter splunkDataExportTopLevelPath [splunk-data]:
- Confirm that the
Successfully created/updated stack
message is shown. Deployment will take approximately 5 minutes.
- You are now ready to test the solution.
The data export pipeline has been fully provisioned through the deployed CloudFormation stack. To run it, navigate to the Amazon Step Functions console. Select the state machine, and select Start execution. After a few minutes, the entire pipeline will run. This will create and populate tables in AWS Glue.
The automation handles:
- API calls to Splunk search REST API
- Conversion of retrieved data into Apache Parquet format with compression and partitioning enabled, and storage in S3 bucket
- Execution of AWS Glue crawler to catalog files stored in S3 as tables in AWS Glue Data Catalog
Progress can be monitored using the Graph view of the state machine execution.
Once complete, data can now be queried using Amazon Athena directly, or with other tools that integrate with Athena, such as Amazon SageMaker and Amazon QuickSight.
If you want additional data to be crawled alongside the data exported from Splunk, create another folder inside the top-level folder, and upload your file to it (e.g. a CSV file). This data will be automatically crawled during the next execution of the Step Functions state machine and will appear as an additional table in the AWS Glue Data Catalog.
- Top-level folder (e.g.
splunk-data/
) - set insplunkDataExportTopLevelPath
CloudFormation parameter- Splunk data export path(s) (e.g.
mypath/
) - set inconfiguration.json
- Additional data files (e.g.
myadditionalpath/
) - manually created (if required)
- Splunk data export path(s) (e.g.
To avoid incurring future charges, delete the CloudFormation stacks that have been provisioned in the AWS account. This can be achieved using:
sam delete
Empty the S3 bucket used for storing files before the template is deleted. The S3 bucket used for the data can be found in the CloudFormation output splunkDataExportS3BucketName
.