MongoDB Sharding

Large datasets stored on a single server can hit performance limits. One way to improve performance on a single server is to upgrade the server with higher CPU, more CPU cores, more RAM, more storage and faster I/O. This is called vertical scaling. It's expensive and has its limits. Sharding is a better and more scalable solution.

With sharding, large datasets are distributed across multiple machines. While each machine may not be very powerful, each one handles only a subset of the overall workload. Thus, high performance is achieved even on large datasets. This approach is called horizontal scaling. MongoDB does horizontal scaling via sharding.

A shard key is used to distribute data evenly across all shards. MongoDB can also reshard the data in case shards become unbalanced over time.

Discussion

• What's the architecture of a sharded cluster in MongoDB?

Sharding is applied on a MongoDB collection, which basically consists of multiple documents. Documents of a collection are distributed across a cluster of machines or nodes. We call this a sharded cluster. It consists of the following components:

• Shard: Each shard holds a subset of the data, that is, some documents of the collection. Each shard can be deployed as a replica set that consists of mongod instances.
• Router: When queries are made on a collection, only relevant shards need to be queried. One or more routers, called mongos, perform this task. Applications should never connect directly to the shards. They should always connect via the routers.
• Config Server: One or more of these store metadata and configuration settings for the cluster. Similar to the shards, config servers are replica sets.

Multiple collections can share a sharded cluster. Moreover, a sharded cluster can include collections that are not sharded, that is, collections that are simply replica sets. Unsharded collections are stored on the primary shard. Each database has a primary shard. Even for unsharded collections, clients should connect via the router.

• What's a chunk in the context of MongoDB sharding?

A chunk is a contiguous range of shard key values within a particular shard. Chunk ranges include the lower boundary and exclude the upper boundary. A chunk size is by default 64 MB and can be configured in the range 1-1024 MB. If a chunk exceeds its configured size, MongoDB automatically splits it into smaller chunks. However, automatic splitting happens only when data is inserted or updated. If necessary, manual splitting can be performed.

If data distribution becomes uneven, MongoDB's balancer process migrates chunks from one shard to another. From this perspective, chunk is the smallest unit of data that can be migrated. It's not possible to move just some documents of a chunk to another shard. However, it's possible to configure a chunk with a single shard key value.

It's possible to remove a shard. But before this happens, chunks from that shard are moved to other shards. This process is called draining.

Sometimes during incomplete migrations, a document can be present on multiple shards. This is called an orphaned document.

• What are the different ways to shard data into MongoDB?

MongoDB partitions a collection based on a shard key, which is based on one or more fields. A unique shard key value can exist in only one chunk. To shard a collection, it must have an index that starts with the shard key. This is called shard key index. A range of shard key values define a chunk. Each chunk is associated with one shard. Multiple chunks can reside on a shard.

MongoDB has two sharding strategies:

• Hashed Sharding: Hash of the shard key value is computed. Particularly for monotonically increasing or decreasing shard key, hashed sharding distributes data more evenly. MongoDB automatically computes the hashes when querying with indexes. For a compound index, only one field can be hashed.
• Ranged Sharding: Data is partitioned based on a range of shard key values. Each chunk is assigned a range.

A MongoDB router queries only shards relevant to a query. In some cases, it broadcasts the query to all shards. Selecting a suitable shard key and sharding strategy can minimize the number of shards involved in a query.

• How do I select a suitable shard key in MongoDB?

A suitable shard key has some desirable properties:

• High Cardinality: Low cardinality implies fewer unique values and hence fewer shards. For example, choosing continent field as the shard key limits the cluster to a maximum of seven shards. Instead, use a compound index by combining continent field with another field of high cardinality.
• Low Frequency: Implies all documents are accessed equally. Suppose only 20% of documents are commonly accessed. This could lead to some shards getting overloaded. Chunks holding high frequency documents may grow faster, become indivisible and result in bottlenecks. Instead, use a compound index that includes a low frequency field.
• Grows Non-Monotonically: If shard key increases monotonically, inserts end up in a chunk that has the maxKey upper bound. The shard containing this chunk presents a write bottleneck. To solve this, use hashed sharding. Another technique is that when a chunk is split, the new chunk with maxKey is located on another shard.
• Handle Common Query Patterns: A query hits only a few shards. Common queries achieve even distribution.
• Given an uneven data distribution in a MongoDB sharded cluster, how can I correct this?

It's hard to predict how a collection will grow or how the application will evolve. The data distribution may start out evenly but may become uneven over time. Therefore there's a need to continuously improve how we shard a collection.

MongoDB 4.2 introduced mutable shard key values. However, the shard key itself couldn't be changed.

MongoDB 4.4 introduced refinable shard key. One or more fields can be suffixed to the current shard key. For example, if using customer_id as the shard key didn't result in an even distribution, the shard key could be refined by adding order_id (which has a higher cardinality). Thus, the shard key becomes {"customer_id" : 1, "order_id": 1}.

MongoDB 4.4 also introduced compound hashed shard key in which one of the fields can be hashed. A compound shard key uses more than one field. In the previous example, if order_id is monotonically increasing, then just adding it to the shard key will not solve our problem. Instead, the shard key can be refined to {"customer_id" : 1, "order_id": "hashed"}.

From MongoDB 5.0, we can simply reshard the collection.

• Could you explain zones and zone sharding in MongoDB?

Zones are abstractions defined based on the shard key. They help implement data locality. For instance, data is stored on shards geographically closer to application servers. This could be for performance or regulatory reasons. We might want to isolate a subset of data on a specific set of shards. Another reason is to route data to match hardware capabilities.

A zone can be associated with one or more shards in the cluster. A shard can associate with multiple zones. It's okay for a shard not belong to any zone. When balancing data, MongoDB migrates chunks to other shards within the zone.

A zone is defined by a range of shard key values. If hashed, range is based on the hashed values. Range includes the lower bound and excludes the upper bound. If a compound shard key is used, range definition must include the prefix field. For example, given the shard key { a : 1, b : 1, c : 1 }, zone range must contain a field.

• Could you share some tips for data sharding in MongoDB?

If you're not sure if a collection requires sharding, you can leave it unsharded. When more data arrives, you may get a better idea about how to define the shard key and how many shards to use. It's possible to migrate from a replica set to a sharded cluster.

Suppose you observe jumbo chunks, uneven load distribution or reduced query performance over time. These are indications that the shard key is suboptimal. Either refine the key or reshard the collection.

MongoDB's balancer process does auto-splitting as required. In some cases, to achieve easy insertion and high write throughput, pre-splitting can be done to balance the distribution quickly.

Multiple routers can reduce latency but can increase load on config servers. More servers and components also leads to complexity. MongoDB Atlas, Ops Manager, Mongostat, and other tools can simplify the management.

Milestones

Feb
2009

MongoDB 1.0 is released. By August, this version becomes generally available for production environments.

Aug
2010

MongoDB 1.6 is released. Sharding is now production ready, thus making MongoDB horizontally scalable. A mongod instance can be upgraded to a distributed cluster without any downtime.

Aug
2012

MongoDB 2.2 is released with support for tag-aware sharding. For example, this can be used to ensure data is closest to application servers that use them. An improvement is that if the shard key is the prefix of existing index, there's no need to maintain a separate index.

Mar
2013

MongoDB 2.4 is released. This release supports hashing of index. When a hashed index is used as the shard key, this enables more even distribution of documents across shards.

Mar
2015

MongoDB 3.0 is released. Replica sets and sharded clusters can have members with different storage engines but performance can vary with workload.

Dec
2015

MongoDB 3.2 is released. Config servers previously used three mirrored mongod instances. They can now be deployed as replica sets. This means that a sharded cluster can have more than 3 (maximum 50) config servers. Master-slave replication for components of sharded cluster is deprecated. Master-slave replication is removed in MongoDB 4.0 (September 2021). Replica sets must be used instead.

Nov
2016

MongoDB 3.4 is released. We can now associate shards with one or more zones. Zone sharding can be considered a renaming of tag-aware sharding.

Nov
2017

MongoDB 3.6 is released. From this release, a shard can't be a standalone instance. It has to be a replica set. This ensure redundancy and high availability.

Aug
2019

MongoDB 4.2 is released. This supports mutable shard key values. With the exception of a shard key based on immutable _id field, any other shard key value is mutable.

Jul
2020

MongoDB 4.4 is released. Two new features are refinable shard key and compound hashed shard key. The latter supports zone sharding as well. These help split huge indivisible chunks due to the current shard key. Some documents may not have the shard key fields. These are treated as null values when distributing the documents but not when routing queries. The previous shard key size limit of 512 bytes is removed in this version.

Jul
2021

MongoDB 5.0 is released with support for resharding of a collection. This can solve the problem of uneven data distribution caused by the current shard key.

Author
No. of Edits
No. of Chats
DevCoins
5
0
1259
1851
Words
1
Likes
1490
Hits

Cite As

Devopedia. 2021. "MongoDB Sharding." Version 5, October 16. Accessed 2022-10-10. https://devopedia.org/mongodb-sharding
Contributed by
1 author

Last updated on
2021-10-16 13:59:45
• Site Map