Data import - Amazon SageMaker AI

Data import

Amazon SageMaker Canvas supports importing tabular, image, and document data. You can import datasets from your local machine, Amazon services such as Amazon S3 and Amazon Redshift, and external data sources. When importing datasets from Amazon S3, you can bring a dataset of any size. Use the datasets that you import to build models and make predictions for other datasets.

Each use case for which you can build a custom model accepts different types of input. For example, if you want to build a single-label image classification model, then you should import image data. For more information about the different model types and the data they accept, see How custom models work. You can import data and build custom models in SageMaker Canvas for the following data types:

  • Tabular (CSV, Parquet, or tables)

    • Categorical – Use categorical data to build custom categorical prediction models for 2 and 3+ category prediction.

    • Numeric – Use numeric data to build custom numeric prediction models.

    • Text – Use text data to build custom multi-category text prediction models.

    • Timeseries – Use timeseries data to build custom time series forecasting models.

  • Image (JPG or PNG) – Use image data to build custom single-label image prediction models.

  • Document (PDF, JPG, PNG, TIFF) – Document data is only supported for SageMaker Canvas Ready-to-use models. To learn more about Ready-to-use models that can make predictions for document data, see Ready-to-use models.

You can import data into Canvas from the following data sources:

  • Local files on your computer

  • Amazon S3 buckets

  • Amazon Redshift provisioned clusters (not Amazon Redshift Serverless)

  • AWS Glue Data Catalog through Amazon Athena

  • Amazon Aurora

  • Amazon Relational Database Service (Amazon RDS)

  • Salesforce Data Cloud

  • Snowflake

  • Databricks, SQLServer, MariaDB, and other popular databases through JDBC connectors

  • Over 40 external SaaS platforms, such as SAP OData

For a full list of data sources from which you can import, see the following table:

Source Type Supported data types

Local file upload

Local

Tabular, Image, Document

Amazon Aurora

Amazon internal

Tabular

Amazon S3 bucket

Amazon internal

Tabular, Image, Document

Amazon RDS

Amazon internal

Tabular

Amazon Redshift provisioned clusters (not Redshift Serverless)

Amazon internal

Tabular

AWS Glue Data Catalog (through Amazon Athena)

Amazon internal

Tabular

Databricks

External

Tabular

Snowflake

External

Tabular

Salesforce Data Cloud

External

Tabular

SQLServer

External

Tabular

MySQL

External

Tabular

PostgreSQL

External

Tabular

MariaDB

External

Tabular

Amplitude

External SaaS platform

Tabular

CircleCI

External SaaS platform

Tabular

DocuSign Monitor

External SaaS platform

Tabular

Domo

External SaaS platform

Tabular

Datadog

External SaaS platform

Tabular

Dynatrace

External SaaS platform

Tabular

Facebook Ads

External SaaS platform

Tabular

Facebook Page Insights

External SaaS platform

Tabular

Google Ads

External SaaS platform

Tabular

Google Analytics 4

External SaaS platform

Tabular

Google Search Console

External SaaS platform

Tabular

GitHub

External SaaS platform

Tabular

GitLab

External SaaS platform

Tabular

Infor Nexus

External SaaS platform

Tabular

Instagram Ads

External SaaS platform

Tabular

Jira Cloud

External SaaS platform

Tabular

LinkedIn Ads

External SaaS platform

Tabular

LinkedIn Ads

External SaaS platform

Tabular

Mailchimp

External SaaS platform

Tabular

Marketo

External SaaS platform

Tabular

Microsoft Teams

External SaaS platform

Tabular

Mixpanel

External SaaS platform

Tabular

Okta

External SaaS platform

Tabular

Salesforce

External SaaS platform

Tabular

Salesforce Marketing Cloud

External SaaS platform

Tabular

Salesforce Pardot

External SaaS platform

Tabular

SAP OData

External SaaS platform

Tabular

SendGrid

External SaaS platform

Tabular

ServiceNow

External SaaS platform

Tabular

Singular

External SaaS platform

Tabular

Slack

External SaaS platform

Tabular

Stripe

External SaaS platform

Tabular

Trend Micro

External SaaS platform

Tabular

Typeform

External SaaS platform

Tabular

Veeva

External SaaS platform

Tabular

Zendesk

External SaaS platform

Tabular

Zendesk Chat

External SaaS platform

Tabular

Zendesk Sell

External SaaS platform

Tabular

Zendesk Sunshine

External SaaS platform

Tabular

Zoom Meetings

External SaaS platform

Tabular

For instructions on how to import data and information regarding input data requirements, such as the maximum file size for images, see Create a dataset.

Canvas also provides several sample datasets in your application to help you get started. To learn more about the SageMaker AI-provided sample datasets you can experiment with, see Use sample datasets.

After you import a dataset into Canvas, you can update the dataset at any time. You can do a manual update or you can set up a schedule for automatic dataset updates. For more information, see Update a dataset.

For more information specific to each dataset type, see the following sections:

Tabular

To import data from an external data source (such as a Snowflake database or a SaaS platform), you must authenticate and connect to the data source in the Canvas application. For more information, see Connect to data sources.

If you want to import datasets larger than 5 GB from Amazon S3 into Canvas, you can achieve faster sampling by using Amazon Athena to query and sample the data from Amazon S3.

After creating datasets in Canvas, you can prepare and transform your data using the data preparation functionality of Data Wrangler. You can use Data Wrangler to handle missing values, transform your features, join multiple datasets into a single dataset, and more. For more information, see Data preparation.

Tip

As long as your data is arranged into tables, you can join datasets from various sources, such as Amazon Redshift, Amazon Athena, or Snowflake.

Image

For information about how to edit an image dataset and perform tasks such as assigning or reassigning labels, adding images, or deleting images, see Edit an image dataset.