Inefficient MongoDB Scanning with `should_add_document_size_filter` #12427

Haebuk · 2025-01-22T05:03:18Z

Describe the bug
The should_add_document_size_filter option is causing inefficient scanning of MongoDB collections. Currently, this option is set to True for all systems that are not AWS DocumentDB with version 4.4 or higher.

def should_add_document_size_filter(self) -> bool:
        # the operation $bsonsize is only available in server version greater than 4.4
        # and is not supported by AWS DocumentDB, we should only add this operation to
        # aggregation for mongodb that doesn't run on AWS DocumentDB and version is greater than 4.4
        # https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html
        return (
            self.is_server_version_gte_4_4() and not self.is_hosted_on_aws_documentdb()
        )

This results in the execution of aggregations like the one shown in the construct_schema_pymongo function:

aggregations = [
    {"$addFields": {doc_size_field: {"$bsonSize": "$$ROOT"}}},
    {"$match": {doc_size_field: {"$lt": max_document_size}}},
    {"$project": {doc_size_field: 0}},
]

Since this aggregation does not utilize indexes, the scan time increases significantly with the number of documents in the collection. In my case, there are dozens of collections with millions of documents each, resulting in a MongoDB ingestion process that takes 15 hours to complete.
No user wants an ingestion process that takes this long, and database administrators are unlikely to tolerate slow queries that take several minutes to execute.

Request
I urge a prompt resolution to this issue. One potential solution could be to support sorting by a specific field (such as a timestamp) in descending order, combined with the use of sample_size. This would allow for the retrieval of the most recent N documents based on an indexed field, thereby replacing the inefficient full scan.

If there are other suggestions or solutions, I would appreciate your input. Thank you.

The text was updated successfully, but these errors were encountered:

Haebuk · 2025-01-22T05:07:56Z

I wrote the same request on this webpage as well.

Haebuk added the bug Bug report label Jan 22, 2025

Haebuk mentioned this issue Jan 22, 2025

feat(ingestion/mongodb) re-order aggregation logic #12428

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient MongoDB Scanning with `should_add_document_size_filter` #12427

Inefficient MongoDB Scanning with `should_add_document_size_filter` #12427

Haebuk commented Jan 22, 2025

Haebuk commented Jan 22, 2025

Inefficient MongoDB Scanning with should_add_document_size_filter #12427

Inefficient MongoDB Scanning with should_add_document_size_filter #12427

Comments

Haebuk commented Jan 22, 2025

Haebuk commented Jan 22, 2025

Inefficient MongoDB Scanning with `should_add_document_size_filter` #12427

Inefficient MongoDB Scanning with `should_add_document_size_filter` #12427