You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The should_add_document_size_filter option is causing inefficient scanning of MongoDB collections. Currently, this option is set to True for all systems that are not AWS DocumentDB with version 4.4 or higher.
defshould_add_document_size_filter(self) ->bool:
# the operation $bsonsize is only available in server version greater than 4.4# and is not supported by AWS DocumentDB, we should only add this operation to# aggregation for mongodb that doesn't run on AWS DocumentDB and version is greater than 4.4# https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.htmlreturn (
self.is_server_version_gte_4_4() andnotself.is_hosted_on_aws_documentdb()
)
This results in the execution of aggregations like the one shown in the construct_schema_pymongo function:
Since this aggregation does not utilize indexes, the scan time increases significantly with the number of documents in the collection. In my case, there are dozens of collections with millions of documents each, resulting in a MongoDB ingestion process that takes 15 hours to complete.
No user wants an ingestion process that takes this long, and database administrators are unlikely to tolerate slow queries that take several minutes to execute.
Request
I urge a prompt resolution to this issue. One potential solution could be to support sorting by a specific field (such as a timestamp) in descending order, combined with the use of sample_size. This would allow for the retrieval of the most recent N documents based on an indexed field, thereby replacing the inefficient full scan.
If there are other suggestions or solutions, I would appreciate your input. Thank you.
The text was updated successfully, but these errors were encountered:
Describe the bug
The
should_add_document_size_filter
option is causing inefficient scanning of MongoDB collections. Currently, this option is set to True for all systems that are not AWS DocumentDB with version 4.4 or higher.This results in the execution of aggregations like the one shown in the construct_schema_pymongo function:
Since this aggregation does not utilize indexes, the scan time increases significantly with the number of documents in the collection. In my case, there are dozens of collections with millions of documents each, resulting in a MongoDB ingestion process that takes 15 hours to complete.
No user wants an ingestion process that takes this long, and database administrators are unlikely to tolerate slow queries that take several minutes to execute.
Request
I urge a prompt resolution to this issue. One potential solution could be to support sorting by a specific field (such as a timestamp) in descending order, combined with the use of sample_size. This would allow for the retrieval of the most recent N documents based on an indexed field, thereby replacing the inefficient full scan.
If there are other suggestions or solutions, I would appreciate your input. Thank you.
The text was updated successfully, but these errors were encountered: