Different ways to get data into Amazon EMR

Amazon EMR provides several ways to get data onto a cluster. The most common way is to upload the data to Amazon S3 and use the built-in features of Amazon EMR to load the data onto your cluster. You can also use the DistributedCache feature of Hadoop to transfer files from a distributed file system to the local file system. The implementation of Hive provided by Amazon EMR (Hive version 0.7.1.1 and later) includes functionality that you can use to import and export data between DynamoDB and an Amazon EMR cluster. If you have large amounts of on-premises data to process, you may find the AWS Direct Connect service useful.

Topics

Upload data to Amazon S3
Upload data with AWS DataSync
Import files with distributed cache with Amazon EMR
Detecting and processing compressed files with Amazon EMR
Import DynamoDB data into Hive with Amazon EMR
Connect to data with AWS Direct Connect from Amazon EMR
Upload large amounts of data for Amazon EMR with AWS Snowball

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Types of input Amazon EMR can accept

Upload data to Amazon S3