MongoDB Aggregation
- Summary
-
Discussion
- What are the ways to do aggregation in MongoDB?
- Could you describe some aggregation pipeline operators or stages?
- What are expression operators in MongoDB aggregation?
- How do I make use of variables in MongoDB aggregation?
- What's the Aggregation Pipeline Builder?
- How's the performance of MongoDB's aggregations?
- Could you share some useful tips when building MongoDB aggregation pipelines?
- How do aggregation operators map to concepts in SQL?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
Data can be processed in the application layer or in the database layer. Aggregation framework or pipeline is the manner in which MongoDB does data processing in the database layer. MongoDB's aggregation framework can do any kind of processing. Therefore it's really a Turing complete functional language.
Often just retrieving data is inadequate. We wish to perform some computations on the data or report on live data. Sometimes we don't want to copy data and process it elsewhere. These are good reasons for using MongoDB's aggregations.
While being flexible and intuitive for developers, MongoDB's aggregation is not as performant as some other databases or tools.
Discussion
-
What are the ways to do aggregation in MongoDB? There are three ways to do aggregation in MongoDB:
- Aggregation Pipeline: We can think of data as moving along a multi-stage pipeline. Each stage does specific processing on the data. Among the basic stages are filters that obtain a subset of matching documents and transforms that reshape data into a suitable output form. Other stages do sorting, grouping, string concatenation, array processing, and so on. The aggregation pipeline can operate on a sharded collection. Some stages can use indexes for better performance.
- Map-Reduce Function: This legacy approach was deprecated in MongoDB 5.0. Aggregation pipeline is now the preferred approach since it offers better performance and usability.
- Single Purpose Aggregation Methods: From the Collection class, the methods
estimatedDocumentCount()
,count()
anddistinct()
aggregate from a single collection. While handy, these lack the flexibility and capability of the aggregation pipeline.
-
Could you describe some aggregation pipeline operators or stages? An aggregation pipeline is constructed with the
aggregate()
method of theCollection
class and takes an array of stages:db.collection.aggregate( [ { <stage> }, ... ] )
. Except for$out
,$merge
and$geoNear
, other stages can appear multiple times in a pipeline.$addFields
: Add new fields to documents. Operator$set
is an alias.$count
: Count number of documents at this stage.$group
: Takes an identifier expression for grouping input documents. If specified, applies an accumulator expression for each group. Outputs one document per distinct group.$limit
: Passes only the first n documents to the next stage.$lookup
: Does a left outer join to another collection in the same database.$match
: Filters documents based on standard MongoDB queries.$merge
: Writes the output to a collection. Must be the last stage of the pipeline.$project
: Reshapes documents by adding or removing fields. Operator$unset
is an alias for removing fields.$sample
: Randomly selects n documents.$skip
: Skips first n documents.$sort
: Reorders document stream by a specified sort key.$unwind
: Given an array field, outputs one document per array element.
-
What are expression operators in MongoDB aggregation? An expression operator has a name and takes either an array of arguments or a single argument. MongoDB has plenty of such operators. By becoming familiar with these, developers can simplify their application code by doing most of the processing in MongoDB.
There are many categories of expression operators: arithmetic, Boolean, comparison, conditional, data size, date, literal, object, set, string, text, trigonometry, type, etc. Custom expression operators include
$accumulator
and$function
. Developers can write custom JavaScript functions (MongoDB 4.4) for these.Accumulators are special expression operators that compute totals, maxima, minima and other values. They maintain their state as documents pass through the pipeline. These can be used only in some stages:
$bucket
,$bucketAuto
,$group
and$setWindowFields
. However, some accumulators can be used in other stages but not as accumulators, that is, they don't maintain state. An example of this is shown in the figure: (a) averaging happens across documents within a group; (b) averaging happens over values within a document. -
How do I make use of variables in MongoDB aggregation? Let's consider a
sales
collection with numeric fieldsprice
andtax
, and Boolean fieldapplyDiscount
. We can use aggregation to compute the total for each sale. This is where the$let
operator becomes useful.The
$let
operator has two parts:vars
block where the variables are assigned andin
expression where the variables are used for computation. Thevars
block can access variables defined externally, including system variables. If externally defined variables are modified invars
, the changed values are seen only within thein
expression. Externally, old values are retained.System variables include
NOW
,CLUSTER_TIME
,ROOT
,CURRENT
,REMOVE
,DESCEND
,PRUNE
andKEEP
. -
What's the Aggregation Pipeline Builder? MongoDB Compass is a GUI for MongoDB. Aggregation Pipeline Builder is one of the tools within Compass. Developers can add/remove stages to the pipeline graphically. In Auto Preview mode, matching documents are automatically previewed. By default, first 20 documents are previewed. If Sample Mode is enabled, a default limit of 100,000 is applied. This affects
$group
,$bucket
, and$bucketAuto
stages.When we select a pipeline operators, Compass gives the full syntax. We need to fill in the placeholders with relevant fields and expressions. If we make a mistake, useful suggestions are given to correct the same. This can be a useful feature for beginners.
It's possible to save the pipeline and reload it later within Compass. With the export feature, the pipeline can be exported in JSON or in a format compatible with a language driver. Likewise, it's possible to import into Compass a pipeline written in MongoDB Query Language.
-
How's the performance of MongoDB's aggregations? Compared to the legacy map-reduce, aggregation framework was 6x faster in a test done back in 2015. Unless you're stuck with legacy code, it makes sense to migrate to the aggregation framework.
Comparing MySQL 8.0 and MongoDB 4.0.3, it was seen that MongoDB is typically faster on more complex queries. It's faster from disk when there are no indexes, whereas MySQL is faster from RAM. BI Connector is slower for simple queries and not as fast as hand-crafted aggregation.
In a separate test, MongoDB's
insertMany()
andupdateMany()
were faster compared to their MySQL equivalents. However, a simplefind()
query was slower than MySQL'sSELECT
.On large collections of millions of documents, MongoDB's aggregation was shown to be much worse than Elasticsearch. Performance worsens with collection size when MongoDB starts using the disk due to limited system RAM.
The
$lookup
stage used without indexes can be very slow.While there are many powerful external compute engines (Spark, Hadoop, R, Python, Java, C), data transfer may become the bottleneck due to cost, network bandwidth, hardware limitations and security considerations. Therefore, use MongoDB aggregations when you can.
-
Could you share some useful tips when building MongoDB aggregation pipelines? The Aggregation Pipeline Quick Reference is a useful reference for developers.
When complex processing is required, aggregation pipelines help us break down the problem into smaller parts. We can think of each stage as solving a subproblem before handing over the intermediate result to the next stage. Pipelines also aid debugging or prototyping because we can comment out some stages.
The
$project
stage can be verbose and non-intuitive. Instead use$set
and$unset
. This makes the pipeline more maintainable when new fields are added to documents later.While MongoDB automatically does optimization, use
db.coll.explain()
to see if further optimizations are possible.Stages
$sort
and$group
are blocking, that is, they can't send their output to the next stage until all input documents are processed. Blocking stages reduce concurrency and consume more memory. Some mitigation strategies include sorting on index, using$limit
with sorting, and sorting later on a smaller subset. Group summary data only. Avoid unnecessary grouping. Use array operators instead of unwinding and regrouping. Move match filters to earlier in the pipeline. -
How do aggregation operators map to concepts in SQL? The SQL to Aggregation Mapping Chart is a useful resource. From here, we note the following SQL-MongoDB equivalence:
- Filtering: SQL uses
WHERE
andHAVING
while MongDB uses the$match
aggregation operator. - Grouping: SQL uses
GROUP BY
while MongDB uses the$group
aggregation operator. - Selecting: SQL uses
SELECT
while MongDB uses the$project
aggregation operator. - Sorting: SQL uses
ORDER BY
while MongDB uses the$sort
aggregation operator. - Limiting: SQL uses
LIMIT
while MongDB uses the$limit
aggregation operator. - Joining: Joining in MongoDB is a left outer join performed using the
$lookup
aggregation operator. - Summing/Counting: SQL's
SUM()
andCOUNT()
are implemented in MongoDB with the$sum
aggregation operator. - Combining: SQL uses
UNION ALL
while MongDB uses the$unionWith
aggregation operator.
The figure gives some examples. The
orders
collection includes an array ofitems
with each item containing fieldssku
,qty
andprice
. - Filtering: SQL uses
Milestones
2009
2009
2012
MongoDB 2.2 is released. This introduces the Aggregation Framework. This is exposed via the aggregate
command and the aggregate()
helper in the shell. Compared to earlier implementation of aggregations, this is far more intuitive, powerful, efficient and scalable. It soon becomes the go-to tool for developers. However, MongoDB continues to support Map-Reduce for legacy reasons.
2015
2016
2017
MongoDB 3.6 is released. New stages include $currentOp
, $listSessions
and $listLocalSessions
. New aggregation operators include $arrayToObject
, $objectToArray
, $mergeObjects
, $dateFromString
, $dateFromParts
and $dateToParts
. For date operators, time zones are supported. Aggregation variable REMOVE
is added.
2018
MongoDB 4.0 is released. New aggregation operators for type conversion and trimming strings are added. MongoDB Compass 1.14 is released with Aggregation Pipeline Builder. In Compass 1.15 (Aug 2018), import/export feature is added to the builder. In Compass 1.16 (Nov 2018), collation support is added. In Compass 1.19 (Aug 2019), views can be created from pipeline results. Builder comes with new settings of sample size, number of documents to preview, and a maximum timeout. In Compass 1.20 (Dec 2019), $set
, $unset
and $replaceWith
pipeline operators are added. In Compass 1.26.1 (Apr 2021), functions are allowed in pipelines.
2019
MongoDB 4.2 is released. Stage $merge
is added. This helps create on-demand materialized views. Other new stages include $planCacheStats
, $replaceWith
, $set
and $unset
. Trigonometry expressions are added. New regex expressions include $regexFind
, $regexFindAll
, and $regexMatch
. In earlier versions, only $regex
query operator could be used in $match
stage. New variables include NOW
and CLUSTER_TIME
. Aggregation pipeline can be used findAndModify
and update
commands.
2020
MongoDB 4.4 is released. Stage $unionWith
can combine pipeline results from multiple collections. Operators $accumulator
and $function
allow custom aggregation expressions instead of using mapReduce
and $where
. Other new operators include $binarySize
, $bsonSize
, $first
(first array item), $last
, $isNumber
, $replaceOne
and $replaceAll
. From this version, $merge
can output to the same collection that's being aggregated; $out
can output to a collection in a different database.
2021
MongoDB 5.0 is released. Pipeline stage $setWindowFields
is added. Some date aggregation operators are added. Aggregation operators $rand
and $sampleRate
are added. For better performance, comparison operators with $expr
use indexes. Multiple input expressions are possible for $ifNull
. For better readability, let
can be used to define and use variables within the pipeline. Concise correlated subqueries are supported for $lookup
.
References
- Develop Paper. 2019. "Mongodb series — deep understanding of mongodb aggregation." Develop Paper, November 30. Accessed 2021-10-16.
- Done, Paul. 2021. "Practical MongoDB Aggregations." v3.00, MongoDB, Inc, October. Accessed 2021-10-10.
- Henry, Onyancha Brian. 2019. "MongoDB vs MySQL NoSQL - Why Mongo is Better." Blog, Severalnines, February 28. Accessed 2021-10-16.
- Kukic, Ado. 2020. "How to Use Custom Aggregation Expressions in MongoDB 4.4." How To, MongoDB, July 28. Accessed 2021-10-16.
- Marturana, Luca. 2015. "MongoDB showdown: Aggregate vs map-reduce." Blog, Sysdig, March 12. Accessed 2021-10-16.
- Maréchal, Le. 2015. "MongoDB vs. Elasticsearch: The Quest of the Holy Performances." Blog, Quickslab, March 23. Accessed 2021-10-16.
- Mastering JS Weekly. 2020. "Why You (Probably) Shouldn't Use the MongoDB Aggregation Framework." Issue#37, Mastering JS Weekly, July 16. Accessed 2021-10-16.
- Melnik, Grigori. 2018. "Introducing the Aggregation Pipeline Builder in MongoDB Compass." Blog, MongoDB, May 30. Updated 2019-03-11. Accessed 2021-10-17.
- MongoDB. 2009. "1.0 GA Released." Blog, MongoDB, August 27. Accessed 2021-10-16.
- MongoDB Docs. 2012. "Release Notes for MongoDB 2.2." August. Accessed 2021-10-16.
- MongoDB Docs. 2015. "Release Notes for MongoDB 3.2." December 8. Accessed 2021-10-16.
- MongoDB Docs. 2016. "Release Notes for MongoDB 3.4." November 29. Accessed 2021-10-16.
- MongoDB Docs. 2017. "Release Notes for MongoDB 3.6." November. Accessed 2021-10-16.
- MongoDB Docs. 2018. "Release Notes for MongoDB 4.0." August. Accessed 2021-10-16.
- MongoDB Docs. 2019. "Release Notes for MongoDB 4.2." August. Accessed 2021-10-16.
- MongoDB Docs. 2020. "Release Notes for MongoDB 4.4." June. Accessed 2021-10-16.
- MongoDB Docs. 2021a. "Aggregation." Documentation, MongoDB 5.0. Accessed 2021-10-16.
- MongoDB Docs. 2021b. "Aggregation Pipeline Quick Reference." Documentation, MongoDB 5.0. Accessed 2021-10-16.
- MongoDB Docs. 2021c. "SQL to Aggregation Mapping Chart." Documentation, MongoDB 5.0. Accessed 2021-10-16.
- MongoDB Docs. 2021d. "Aggregation Pipeline Builder." Documentation, MongoDB Compass 1.28.4. Accessed 2021-10-17.
- MongoDB Docs. 2021e. "Import Pipeline from Text." Documentation, MongoDB Compass 1.28.4. Accessed 2021-10-17.
- MongoDB Docs. 2021f. "MongoDB Compass: Release Notes." MongoDB Compass 1.28.4, August 30. Accessed 2021-10-17.
- MongoDB Docs. 2021g. "SQL to Aggregation Mapping Chart." Documentation, MongoDB 5.0. Accessed 2021-10-17.
- MongoDB Docs. 2021h. "$avg (aggregation)." Documentation, MongoDB 5.0. Accessed 2021-11-04.
- MongoDB Docs. 2021i. "$let (aggregation)." Documentation, MongoDB 5.0. Accessed 2021-11-04.
- MongoDB Docs. 2021j. "Release Notes for MongoDB 5.0." July 13. Accessed 2021-10-16.
- Page, John. 2018. "MongoDB Aggregation Performance." Slides, MongoDB Europe'18, via SlideShare, November 13. Accessed 2021-10-16.
- Weinberger, Claudius. 2018. "NoSQL Performance Benchmark 2018 – MongoDB, PostgreSQL, OrientDB, Neo4j and ArangoDB." ArangoDB, February 14. Accessed 2021-10-16.
Further Reading
- Paul, Rohan. 2019. "Aggregation in Mongodb." Medium, April 11. Accessed 2021-10-16.
- Develop Paper. 2019. "Mongodb series — deep understanding of mongodb aggregation." Develop Paper, November 30. Accessed 2021-10-16.
- MongoDB Docs. 2021a. "Aggregation." Documentation, MongoDB 5.0. Accessed 2021-10-16.
- MongoDB Docs. 2021b. "Aggregation Pipeline Quick Reference." Documentation, MongoDB 5.0. Accessed 2021-10-16.
- Done, Paul. 2021. "Practical MongoDB Aggregations." v3.00, MongoDB, Inc, October. Accessed 2021-10-10.
- Maréchal, Le. 2015. "MongoDB vs. Elasticsearch: The Quest of the Holy Performances." Blog, Quickslab, March 23. Accessed 2021-10-16.
Article Stats
Cite As
See Also
- MongoDB View
- MongoDB Query Language
- MongoDB Query Optimization
- Data Modelling with MongoDB
- Data Pipeline
- Data-Oriented Programming