AnnouncementIntroducing MongoDB 8.0, the fastest MongoDB ever! Read more >
NEWSLearn why MongoDB was named a leader in the 2024 Gartner® Magic Quadrant™ Read the blog >
AnnouncementIntroducing Search Demo Builder, the newest addition to the Atlas Search Playground Learn more >

Connectors

MongoDB Connector for Apache Spark

Build new classes of sophisticated, real-time analytics by combining Apache Spark, the industry's leading data processing engine, with MongoDB, the industry’s fastest growing database. The MongoDB Connector for Apache Spark is generally available, certified, and supported for production usage today.
Download Now

Access insights now

We live in a world of “big data”. But it isn’t just the data itself that is valuable – it’s the insight it can generate. How quickly an organization can unlock and act on that insight has become a major source of competitive advantage. Collecting data in operational systems and then relying on nightly batch extract, transform, load (ETL) processes to update the enterprise data warehouse (EDW) is no longer sufficient.

An illustration of an increasing bar graph and rocketship
A diagram outlining the analytics application facilitated by the Apache Spark Connector

Unlock the power of Apache Spark

The MongoDB Connector for Apache Spark exposes all of Spark’s libraries, including Scala, Java, Python and R. MongoDB data is materialized as DataFrames and Datasets for analysis with machine learning, graph, streaming, and SQL APIs.

Leverage the power of MongoDB

The MongoDB Connector for Apache Spark can take advantage of MongoDB’s aggregation pipeline and rich secondary indexes to extract, filter, and process only the data it needs – for example, analyzing all customers located in a specific geography. Traditional NoSQL datastores do not offer secondary indexes or in-database aggregations. In these cases, Spark would need to extract all data based on a simple primary key, even if only a subset of that data is required for the Spark process. The MongoDB Connector for Apache Spark co-locates Resilient Distributed Datasets (RDDs) with the source MongoDB node to minimize data movement across the cluster and reducing latency.
An illustration of an aggregation pipeline with data flowing

MongoDB and Apache Spark: Working for Data Science Teams Today

While MongoDB natively offers rich real-time analytics capabilities, there are use cases where integrating the Apache Spark engine can extend the processing of operational data managed by MongoDB. This allows users to operationalize results generated from Spark within real-time business processes supported by MongoDB.

Next steps

Ready to get started?

Get the MongoDB connector for Apache Spark.
Try It NowContact sales
Database illustration