Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient MongoDB Scanning with should_add_document_size_filter #12427

Open
Haebuk opened this issue Jan 22, 2025 · 1 comment
Open

Inefficient MongoDB Scanning with should_add_document_size_filter #12427

Haebuk opened this issue Jan 22, 2025 · 1 comment
Labels
bug Bug report

Comments

@Haebuk
Copy link

Haebuk commented Jan 22, 2025

Describe the bug
The should_add_document_size_filter option is causing inefficient scanning of MongoDB collections. Currently, this option is set to True for all systems that are not AWS DocumentDB with version 4.4 or higher.

def should_add_document_size_filter(self) -> bool:
        # the operation $bsonsize is only available in server version greater than 4.4
        # and is not supported by AWS DocumentDB, we should only add this operation to
        # aggregation for mongodb that doesn't run on AWS DocumentDB and version is greater than 4.4
        # https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html
        return (
            self.is_server_version_gte_4_4() and not self.is_hosted_on_aws_documentdb()
        )

This results in the execution of aggregations like the one shown in the construct_schema_pymongo function:

aggregations = [
    {"$addFields": {doc_size_field: {"$bsonSize": "$$ROOT"}}},
    {"$match": {doc_size_field: {"$lt": max_document_size}}},
    {"$project": {doc_size_field: 0}},
]

Since this aggregation does not utilize indexes, the scan time increases significantly with the number of documents in the collection. In my case, there are dozens of collections with millions of documents each, resulting in a MongoDB ingestion process that takes 15 hours to complete.
No user wants an ingestion process that takes this long, and database administrators are unlikely to tolerate slow queries that take several minutes to execute.

Request
I urge a prompt resolution to this issue. One potential solution could be to support sorting by a specific field (such as a timestamp) in descending order, combined with the use of sample_size. This would allow for the retrieval of the most recent N documents based on an indexed field, thereby replacing the inefficient full scan.

If there are other suggestions or solutions, I would appreciate your input. Thank you.

@Haebuk Haebuk added the bug Bug report label Jan 22, 2025
@Haebuk
Copy link
Author

Haebuk commented Jan 22, 2025

I wrote the same request on this webpage as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

1 participant