Writing Effective Queries in MongoDB

MongoDB is currently the most popular NoSQL database, with flexible schemas suitable for big data applications. However, MongoDB’s query language can be challenging to learn compared to SQL, and writing effective queries in MongoDB requires practice.

In this article, I’ll share a recent MongoDB query issue I faced and the lessons learned from my mistakes.

TL;DR;

  • Avoid writing nested pipelines/aggregations—they can make code unreadable and, in some cases, inefficient.
  • Avoid using nested fields in collections unless absolutely necessary, as they make lookups more challenging and potentially expensive.
  • Use MongoDB’s `explain` functionality to understand the winning plan of your query—the most efficient way MongoDB will process it.
  • Leverage MongoDB’s search capabilities (Atlas Search) to optimise your queries.
cursor = db.incoming_messages.aggregate([
{
'$lookup': {
'from': "contacts",
'let': { 'phone': "$phone" },
'pipeline': [
{ '$match': { '$expr': { '$gt': [{ '$indexOfCP': ["$Tel", "$$phone"] }, -1] } } }
],
'as': 'contact'
}
}
])

Although this aggregation worked, it took minutes to process just 50 documents, scanning the entire collection each time. Woefully inefficient for larger datasets.

MongoDB has a helpful explain functionality to understand the query’s optimal plan. Running this in Jupyter with Python3, I received the following results:

The winning plan was COLLSCAN, indicating a full collection scan, despite indexing both collections. The query took about 20 seconds on an input collection of just 10 documents, simply unacceptable for large datasets.

The Solution

After research, I discovered MongoDB’s Atlas Search, a native engine that simplifies building fast, efficient searches on MongoDB.

The first step was to create a text index on the target collection’s search fields. According to MongoDB’s documentation, you can only create one text index per collection. However, a text index can include multiple fields and be a compound index. In Python, creating an index on the Tel field of the contacts collection looks like this:

contacts_collection.create_index([("Tel", pymongo.TEXT)])

This setup enabled a lookup query on 4,500 documents in a 350,000-document collection in 3.8 seconds, much faster than before.

Profiling this query using MongoDB’s explain method showed the following result:

Now, an IXSCAN (index scan) replaces the full collection scan (COLLSCAN), reducing search footprint and enabling faster query response times.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *