Data import
Amazon SageMaker Canvas supports importing tabular, image, and document data. You can import datasets from your local machine, Amazon services such as Amazon S3 and Amazon Redshift, and external data sources. When importing datasets from Amazon S3, you can bring a dataset of any size. Use the datasets that you import to build models and make predictions for other datasets.
Each use case for which you can build a custom model accepts different types of input. For example, if you want to build a single-label image classification model, then you should import image data. For more information about the different model types and the data they accept, see How custom models work. You can import data and build custom models in SageMaker Canvas for the following data types:
-
Tabular (CSV, Parquet, or tables)
Categorical – Use categorical data to build custom categorical prediction models for 2 and 3+ category prediction.
Numeric – Use numeric data to build custom numeric prediction models.
Text – Use text data to build custom multi-category text prediction models.
Timeseries – Use timeseries data to build custom time series forecasting models.
Image (JPG or PNG) – Use image data to build custom single-label image prediction models.
Document (PDF, JPG, PNG, TIFF) – Document data is only supported for SageMaker Canvas Ready-to-use models. To learn more about Ready-to-use models that can make predictions for document data, see Ready-to-use models.
You can import data into Canvas from the following data sources:
Local files on your computer
Amazon S3 buckets
Amazon Redshift provisioned clusters (not Amazon Redshift Serverless)
AWS Glue Data Catalog through Amazon Athena
-
Amazon Aurora
-
Amazon Relational Database Service (Amazon RDS)
-
Salesforce Data Cloud
Snowflake
-
Databricks, SQLServer, MariaDB, and other popular databases through JDBC connectors
Over 40 external SaaS platforms, such as SAP OData
For a full list of data sources from which you can import, see the following table:
Source | Type | Supported data types |
---|---|---|
Local file upload |
Local |
Tabular, Image, Document |
Amazon Aurora |
Amazon internal |
Tabular |
Amazon S3 bucket |
Amazon internal |
Tabular, Image, Document |
Amazon RDS |
Amazon internal |
Tabular |
Amazon Redshift provisioned clusters (not Redshift Serverless) |
Amazon internal |
Tabular |
AWS Glue Data Catalog (through Amazon Athena) |
Amazon internal |
Tabular |
External |
Tabular |
|
Snowflake |
External |
Tabular |
External |
Tabular |
|
SQLServer |
External |
Tabular |
MySQL |
External |
Tabular |
PostgreSQL |
External |
Tabular |
MariaDB |
External |
Tabular |
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
|
External SaaS platform |
Tabular |
For instructions on how to import data and information regarding input data requirements, such as the maximum file size for images, see Create a dataset.
Canvas also provides several sample datasets in your application to help you get started. To learn more about the SageMaker AI-provided sample datasets you can experiment with, see Use sample datasets.
After you import a dataset into Canvas, you can update the dataset at any time. You can do a manual update or you can set up a schedule for automatic dataset updates. For more information, see Update a dataset.
For more information specific to each dataset type, see the following sections:
Tabular
To import data from an external data source (such as a Snowflake database or a SaaS platform), you must authenticate and connect to the data source in the Canvas application. For more information, see Connect to data sources.
If you want to import datasets larger than 5 GB from Amazon S3 into Canvas, you can achieve faster sampling by using Amazon Athena to query and sample the data from Amazon S3.
After creating datasets in Canvas, you can prepare and transform your data using the data preparation functionality of Data Wrangler. You can use Data Wrangler to handle missing values, transform your features, join multiple datasets into a single dataset, and more. For more information, see Data preparation.
Tip
As long as your data is arranged into tables, you can join datasets from various sources, such as Amazon Redshift, Amazon Athena, or Snowflake.
Image
For information about how to edit an image dataset and perform tasks such as assigning or reassigning labels, adding images, or deleting images, see Edit an image dataset.