MongoDB Sharding
- Summary
-
Discussion
- What's the architecture of a sharded cluster in MongoDB?
- What's a chunk in the context of MongoDB sharding?
- What are the different ways to shard data into MongoDB?
- How do I select a suitable shard key in MongoDB?
- Given an uneven data distribution in a MongoDB sharded cluster, how can I correct this?
- Could you explain zones and zone sharding in MongoDB?
- Could you share some tips for data sharding in MongoDB?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
Large datasets stored on a single server can hit performance limits. One way to improve performance on a single server is to upgrade the server with higher CPU, more CPU cores, more RAM, more storage and faster I/O. This is called vertical scaling. It's expensive and has its limits. Sharding is a better and more scalable solution.
With sharding, large datasets are distributed across multiple machines. While each machine may not be very powerful, each one handles only a subset of the overall workload. Thus, high performance is achieved even on large datasets. This approach is called horizontal scaling. MongoDB does horizontal scaling via sharding.
A shard key is used to distribute data evenly across all shards. MongoDB can also reshard the data in case shards become unbalanced over time.
Discussion
-
What's the architecture of a sharded cluster in MongoDB? Sharding is applied on a MongoDB collection, which basically consists of multiple documents. Documents of a collection are distributed across a cluster of machines or nodes. We call this a sharded cluster. It consists of the following components:
- Shard: Each shard holds a subset of the data, that is, some documents of the collection. Each shard can be deployed as a replica set that consists of
mongod
instances. - Router: When queries are made on a collection, only relevant shards need to be queried. One or more routers, called
mongos
, perform this task. Applications should never connect directly to the shards. They should always connect via the routers. - Config Server: One or more of these store metadata and configuration settings for the cluster. Similar to the shards, config servers are replica sets.
Multiple collections can share a sharded cluster. Moreover, a sharded cluster can include collections that are not sharded, that is, collections that are simply replica sets. Unsharded collections are stored on the primary shard. Each database has a primary shard. Even for unsharded collections, clients should connect via the router.
- Shard: Each shard holds a subset of the data, that is, some documents of the collection. Each shard can be deployed as a replica set that consists of
-
What's a chunk in the context of MongoDB sharding? A chunk is a contiguous range of shard key values within a particular shard. Chunk ranges include the lower boundary and exclude the upper boundary. A chunk size is by default 64 MB and can be configured in the range 1-1024 MB. If a chunk exceeds its configured size, MongoDB automatically splits it into smaller chunks. However, automatic splitting happens only when data is inserted or updated. If necessary, manual splitting can be performed.
If data distribution becomes uneven, MongoDB's balancer process migrates chunks from one shard to another. From this perspective, chunk is the smallest unit of data that can be migrated. It's not possible to move just some documents of a chunk to another shard. However, it's possible to configure a chunk with a single shard key value.
It's possible to remove a shard. But before this happens, chunks from that shard are moved to other shards. This process is called draining.
Sometimes during incomplete migrations, a document can be present on multiple shards. This is called an orphaned document.
-
What are the different ways to shard data into MongoDB? MongoDB partitions a collection based on a shard key, which is based on one or more fields. A unique shard key value can exist in only one chunk. To shard a collection, it must have an index that starts with the shard key. This is called shard key index. A range of shard key values define a chunk. Each chunk is associated with one shard. Multiple chunks can reside on a shard.
MongoDB has two sharding strategies:
- Hashed Sharding: Hash of the shard key value is computed. Particularly for monotonically increasing or decreasing shard key, hashed sharding distributes data more evenly. MongoDB automatically computes the hashes when querying with indexes. For a compound index, only one field can be hashed.
- Ranged Sharding: Data is partitioned based on a range of shard key values. Each chunk is assigned a range.
A MongoDB router queries only shards relevant to a query. In some cases, it broadcasts the query to all shards. Selecting a suitable shard key and sharding strategy can minimize the number of shards involved in a query.
-
How do I select a suitable shard key in MongoDB? A suitable shard key has some desirable properties:
- High Cardinality: Low cardinality implies fewer unique values and hence fewer shards. For example, choosing
continent
field as the shard key limits the cluster to a maximum of seven shards. Instead, use a compound index by combiningcontinent
field with another field of high cardinality. - Low Frequency: Implies all documents are accessed equally. Suppose only 20% of documents are commonly accessed. This could lead to some shards getting overloaded. Chunks holding high frequency documents may grow faster, become indivisible and result in bottlenecks. Instead, use a compound index that includes a low frequency field.
- Grows Non-Monotonically: If shard key increases monotonically, inserts end up in a chunk that has the
maxKey
upper bound. The shard containing this chunk presents a write bottleneck. To solve this, use hashed sharding. Another technique is that when a chunk is split, the new chunk withmaxKey
is located on another shard. - Handle Common Query Patterns: A query hits only a few shards. Common queries achieve even distribution.
- High Cardinality: Low cardinality implies fewer unique values and hence fewer shards. For example, choosing
-
Given an uneven data distribution in a MongoDB sharded cluster, how can I correct this? It's hard to predict how a collection will grow or how the application will evolve. The data distribution may start out evenly but may become uneven over time. Therefore there's a need to continuously improve how we shard a collection.
MongoDB 4.2 introduced mutable shard key values. However, the shard key itself couldn't be changed.
MongoDB 4.4 introduced refinable shard key. One or more fields can be suffixed to the current shard key. For example, if using
customer_id
as the shard key didn't result in an even distribution, the shard key could be refined by addingorder_id
(which has a higher cardinality). Thus, the shard key becomes{"customer_id" : 1, "order_id": 1}
.MongoDB 4.4 also introduced compound hashed shard key in which one of the fields can be hashed. A compound shard key uses more than one field. In the previous example, if
order_id
is monotonically increasing, then just adding it to the shard key will not solve our problem. Instead, the shard key can be refined to{"customer_id" : 1, "order_id": "hashed"}
. -
Could you explain zones and zone sharding in MongoDB? Zones are abstractions defined based on the shard key. They help implement data locality. For instance, data is stored on shards geographically closer to application servers. This could be for performance or regulatory reasons. We might want to isolate a subset of data on a specific set of shards. Another reason is to route data to match hardware capabilities.
A zone can be associated with one or more shards in the cluster. A shard can associate with multiple zones. It's okay for a shard not belong to any zone. When balancing data, MongoDB migrates chunks to other shards within the zone.
A zone is defined by a range of shard key values. If hashed, range is based on the hashed values. Range includes the lower bound and excludes the upper bound. If a compound shard key is used, range definition must include the prefix field. For example, given the shard key
{ a : 1, b : 1, c : 1 }
, zone range must containa
field. -
Could you share some tips for data sharding in MongoDB? If you're not sure if a collection requires sharding, you can leave it unsharded. When more data arrives, you may get a better idea about how to define the shard key and how many shards to use. It's possible to migrate from a replica set to a sharded cluster.
Suppose you observe jumbo chunks, uneven load distribution or reduced query performance over time. These are indications that the shard key is suboptimal. Either refine the key or reshard the collection.
MongoDB's balancer process does auto-splitting as required. In some cases, to achieve easy insertion and high write throughput, pre-splitting can be done to balance the distribution quickly.
Multiple routers can reduce latency but can increase load on config servers. More servers and components also leads to complexity. MongoDB Atlas, Ops Manager, Mongostat, and other tools can simplify the management.
Milestones
2009
2010
2012
2013
2015
2015
MongoDB 3.2 is released. Config servers previously used three mirrored mongod
instances. They can now be deployed as replica sets. This means that a sharded cluster can have more than 3 (maximum 50) config servers. Master-slave replication for components of sharded cluster is deprecated. Master-slave replication is removed in MongoDB 4.0 (September 2021). Replica sets must be used instead.
2016
2017
2019
2020
MongoDB 4.4 is released. Two new features are refinable shard key and compound hashed shard key. The latter supports zone sharding as well. These help split huge indivisible chunks due to the current shard key. Some documents may not have the shard key fields. These are treated as null values when distributing the documents but not when routing queries. The previous shard key size limit of 512 bytes is removed in this version.
References
- Banker, Kyle, Peter Bakkum, Shaun Verch, Douglas Garrett, and Tim Hawkins. 2016. "MongoDB in Action." Second Edition, Manning Publications. Accessed 2021-10-10.
- Done, Paul. 2021. "History Of MongoDB Aggregations." Section 1.2 in: Practical MongoDB Aggregations, v3.00, MongoDB, Inc. Accessed 2021-10-10.
- Mangal, Aayushi. 2018. "Zone Based Sharding in MongoDB." Blog, Percona, June 13. Accessed 2021-10-14.
- Maynard, Simon. 2014. "A Year Running a Sharded MongoDB Cluster: Tools & tips from the frontline." Blog, Bugsnag, Smartbear Software, October 7. Accessed 2021-10-12.
- MongoDB. 2009. "1.0 GA Released." Blog, MongoDB, August 27. Accessed 2021-10-11.
- MongoDB. 2021. "Sharding in MongoDB." MongoDB, June 16. Accessed 2021-10-12.
- MongoDB Docs. 2010. "Release Notes for MongoDB 1.6." August. Accessed 2021-10-11.
- MongoDB Docs. 2012. "Release Notes for MongoDB 2.2." August. Accessed 2021-10-11.
- MongoDB Docs. 2013. "Release Notes for MongoDB 2.4." March 19. Accessed 2021-10-11.
- MongoDB Docs. 2015a. "Release Notes for MongoDB 3.0." March 3. Accessed 2021-10-11.
- MongoDB Docs. 2015b. "Release Notes for MongoDB 3.2." December 8. Accessed 2021-10-11.
- MongoDB Docs. 2016. "Release Notes for MongoDB 3.4." November 29. Accessed 2021-10-11.
- MongoDB Docs. 2017. "Release Notes for MongoDB 3.6." November. Accessed 2021-10-11.
- MongoDB Docs. 2018. "Release Notes for MongoDB 4.0." June. Accessed 2021-10-11.
- MongoDB Docs. 2019. "Release Notes for MongoDB 4.2." August. Accessed 2021-10-11.
- MongoDB Docs. 2020. "Release Notes for MongoDB 4.4." June. Accessed 2021-10-11.
- MongoDB Docs. 2021a. "Release Notes for MongoDB 5.0." July 13. Accessed 2021-10-11.
- MongoDB Docs. 2021b. "Choose a Shard Key." Documentation, MongoDB 5.0. Accessed 2021-10-12.
- MongoDB Docs. 2021c. "Sharding." Documentation, MongoDB 5.0. Accessed 2021-10-12.
- MongoDB Docs. 2021d. "Shards." Documentation, MongoDB 5.0. Accessed 2021-10-12.
- MongoDB Docs. 2021e. "mongos." Documentation, MongoDB 5.0. Accessed 2021-10-12.
- MongoDB Docs. 2021f. "Modify Chunk Size in a Sharded Cluster." Documentation, MongoDB 5.0. Accessed 2021-10-13.
- MongoDB Docs. 2021g. "Split Chunks in a Sharded Cluster." Documentation, MongoDB 5.0. Accessed 2021-10-13.
- MongoDB Docs. 2021h. "Glossary." Documentation, MongoDB 5.0. Accessed 2021-10-13.
- MongoDB Docs. 2021i. "Data Partitioning with Chunks." Documentation, MongoDB 5.0. Accessed 2021-10-13.
- MongoDB Docs. 2021j. "Change a Shard Key." Documentation, MongoDB 5.0. Accessed 2021-10-13.
- MongoDB Docs. 2021k. "Convert a Replica Set to a Sharded Cluster." Documentation, MongoDB 5.0. Accessed 2021-10-13.
- MongoDB Docs. 2021l. "MongoDB Limits and Thresholds." Documentation, MongoDB 5.0. Accessed 2021-10-13.
- MongoDB Docs. 2021m. "Zones." Documentation, MongoDB 5.0. Accessed 2021-10-14.
- MongoDB Docs. 2021n. "Hashed Sharding." Documentation, MongoDB 5.0. Accessed 2021-10-14.
- MongoDB YouTube. 2020. "More Flexibility?! Sharding Gets Even Easier with MongoDB 4.4." MongoDB, on YouTube, June 11. Accessed 2021-10-12.
- Terpko, Jason. 2018. "MongoDB Chunks - Distribution, Splitting, and Merging." Slides, Rackspace, on SlideShare, April 21. Accessed 2021-10-13.
Further Reading
- Zola, William. 2015. "On Selecting a Shard Key for MongoDB." Blog, MongoDB, June 18. Updated 2019-09-03. Accessed 2021-10-12.
- MongoDB Docs. 2021c. "Sharding." Documentation, MongoDB 5.0. Accessed 2021-10-12.
- Wickramasinghe, Shanika. 2020. "MongoDB Sharding: Concepts, Examples & Tutorials." Blog, BMC Software, November 11. Accessed 2021-10-12.
- Chhabra, Manik. 2021. "Implementing MongoDB Sharding: 6 Easy Steps." Hevo Data, March 12. Accessed 2021-10-12.
- Alger, Ken W. 2017. "Choosing a good Shard Key in MongoDB." Blog, June 12. Accessed 2021-10-12.
- Kaplan, Moshe. 2018. "10 Things You Should Know About MongoDB Sharding." Database Zone, DZone, November 26. Accessed 2021-10-12.
Article Stats
Cite As
See Also
- Database Scaling
- Database Partitioning
- Resharding Databases
- NoSQL Databases
- MongoDB
- MongoDB Query Language