Database Sharding

Article Info

Contributed by
3 authors

Last updated on
2020-10-31 03:30:22

Resharding Databases
Database Scaling
Database Partitioning
Database Replication
Distributed Database
ACID Transactions

Article Versions

4 2020-10-31 03:30:22
2274,2273 4,2274

By arvindpdmn

Perfect. No changes during review. Publishing.
3 2020-10-30 17:45:19
2273,2272 3,2273

By lokesh.rawat

Added content and images.
2 2020-10-28 10:14:43
2272,947 2,2272

By lokesh.rawat

Adding some questions on the topic. Add extra tag.
1 2018-09-11 08:26:22
1,947

By tintin

First version

Chat Room

Submitting ...

You are editing an existing chat message.
2020-10-30 17:45:19
-

By arvindpdmn

[Draft Approval Comment]
Approving draft. Will do a full review before publishing.
2020-10-28 10:14:43
-

By arvindpdmn

[Draft Approval Comment]
Good set of questions.

An SQL database table split into three shards. Source: Oracle Docs 2020b, fig. 49-1.

In the age of Big Data, popular social media platforms and IoT sensor data, datasets are huge. If such a dataset is stored in a single database, queries become slow. Overall system performance suffers. This is when database sharding becomes useful.

A single logical dataset is split into multiple databases that are then distributed across multiple machines. When a query is made, only one or a few machines may get involved in processing the query. Sharding enables effective scaling and management of large datasets. There are many ways to split a dataset into shards.

Sharding is possible with both SQL and NoSQL databases. Some databases have out-of-the-box support for sharding. For others, tools and middleware are available to assist in sharding.

Database replication, partitioning and clustering are concepts related to sharding.

Discussion

When do I need to shard my database?
When to do database sharding. Source: Nasser 2020.
SQL query performance can be improved by simply indexing tables. But on large tables (millions of rows) searching even the indices can be slow.
One approach is to do partitioning. Large tables are split into smaller tables. Each partition has it's own index. Suppose tables are partitioned by rows, query is processed on a smaller index. All partitions reside on the same server.
A complementary approach is vertical scaling, which is a hardware upgrade: faster processor, more RAM, faster disk, etc. However, this is not scalable or cost effective in the long run.
Even with partitioning and vertical scaling, the server can get overwhelmed with too many requests. One solution is replication. The master database is replicated in other servers called slaves. Read query is processed by one of the slaves. This speeds up database reads but not writes, which happen only on the master server. Reads can also be improved with caching.
When all these have failed, use sharding as the last resort. Sharding enables horizontal scaling, which involves adding more servers that share the load and process requests in parallel.
How is sharding different from partitioning?
Some use the terms partitioning and sharding interchangeably. However, there are clear differences.
All partitions of a table reside on the same server whereas sharding involves multiple servers. Therefore, sharding implies a distributed architecture whereas partitioning does not.
Partitions can be horizontal (split by rows) or vertical (by columns). Shards are usually only horizontal. In other words, all shards share the same schema but contain different records of the original table. For this reason, sharding is sometimes called horizontal partitioning.
Due to its distributed nature, sharding is more complex than partitioning. Many databases including older ones offer built-in support for partitioning. Sharding is supported mainly in modern databases. Where databases lack support for sharding, and no relevant middleware is used, the logic of which shard to hit resides in application code. With partitions, the logic resides at the database level.
Sharding adopts a shared-nothing architecture. A shard is not aware of other shards. This is often not the case with database clustering or replication.
What are some strategies for database sharding?
Illustrating hash-based and range-based sharding. Source: Adapted from Kim 2014.
A sharding strategy is necessary to determine which record goes into which shard:
- Key-Based: One of the columns, called shard key, is put through a hash function. The result determines the shard for that row. Also called hash-based sharding or algorithmic sharding.
- Range-Based: For example, given a product-price database, prices in range 0-49 go into shard 1, 50-99 into shard 2, and so on. Price column is the shard key. If the store sells lot more low-value products, this will result in unbalanced shards and hotspots.
- Dictionary-Based: A lookup table is used. This is a flexible approach since there's no predetermined hash function or ranges. An example is to shard by customer's region such as UK, US, or EU.
- Hierarchical: A combination of row and column is used as the shard key. Previously mentioned sharding strategies can be used on the key.
- Entity Groups: To facilitate queries across multiple tables, this strategy places related tables within the same shard. This brings stronger consistency. For example, all data pertaining to a user reside in the same shard.
How should I select a suitable shard key?
Selecting a shard key is an important decision. Once selected, it's hard to change this in future. Consider how the choice will affect the current schema, queries and query performance. The choice should support current and future business use cases. Identify main parts of the data and clustering patterns.
For efficiency, the data type of the shard key is ideally an integer.
Sharding must balance two constraints: minimize cross-partition queries and distribute the load evenly by sharding at the right granularity.
Shard key must be on a column (SQL) or field (NoSQL) that meet the following:
- Stability: It almost never changes. Otherwise, we could incur expensive movement of data from one shard to another.
- High Cardinality: It has many more unique values relative to the number of shards.
- Low Frequency: Every value occurs with a low frequency.
- Non-Monotonic: The value doesn't increase or decrease monotonically.
Could you share some examples of shard keys?
In an eCommerce application, Inventory, Sales and Customers could be functional areas with shard keys productID, orderID and customerID respectively. Alternatively, customerID could be the shard key for Sales as well. Typically customerID is unique and of high cardinality. Hence, this could be the shard key rather than country.
In an IoT application, if all devices are storing data at similar intervals then deviceID is a suitable key. If some devices store lot more data than others, this would create hotspots and therefore deviceID may not suitable. For an application with a few status codes, using status code as shard key is not a good idea.
Writing across shards is to be avoided. As an example, an eCommerce order may insert buyer address and seller address. Though both are part of the same logical table, they could reside on different shards. Another example is when a transaction inserts an order entry and an inventory entry. These two could be on different shards. It's preferable to locate all data pertaining to a transaction on a single shard.
What's a typical database sharding architecture?
A MongoDB sharded cluster. Source: MongoDB Docs 2020a.
There's no requirement that a server must host a single shard. Multiple shards can be hosted on the same server. When load increases, they can be moved to separate servers. Original database is split by a shard key. Data that have the same key is called a logical shard. One or more logical shards when hosted on the same server is called physical shard.
In a typical architecture, a cluster proxy is connected to multiple shards. The proxy has access to a configuration service (such as Zookeeper) that keeps track of shards and their indices. The cluster proxy looks at a query and the configuration to know which shard to hit. If shards are large, each shard may also have its own proxy that can do caching and monitoring. With hierarchical sharding, each shard may be sharded further.
In Oracle Sharding, Shard Catalog contains the configuration. Routing is done by Shard Director. All shards are in the same region (data center). For high availability, replicated shards are in the same region as the primary shard. For disaster recovery, replicated shards are in another region.
How is sharding different between SQL and NoSQL databases?
Comparing SQL and NoSQL sharding architectures. Source: Adapted from Lim 2020.
SQL databases were designed with ACID transactions in mind, not scalability. NoSQL databases were designed with scalability in mind. Sharding in NoSQL databases happens almost transparent to the user. Availability is prioritized over consistency.
In SQL tables, shard key is based on one or more columns. In NoSQL records, one or more fields are used instead.
Since SQL databases combine sharding with replication, it's a master-slave architecture. In NoSQL world, shards are referred to as nodes. All nodes are equal, with one node connected to a few others in a graph-like structure. When a query comes in, a coordinator node is selected. If the coordinator can respond directly, it will. Otherwise, it forwards the query to another node that has the data. Nodes frequently share information about the data they store via what's called the gossip protocol.
What are the pros and cons of database sharding?
The main benefits of sharding include horizontal scaling, better performance, and higher availability. Processing and storage are distributed. An outage will most likely affect only one shard.
Sharding can lower cost by avoiding high-end hardware or expensive software. Commodity hardware and open-source software can be used.
Shards are suited for cloud deployments. Any upgrade can be tested on one shard before rolling it out to the rest. Shards can be geographically distributed to meet regulatory requirements or be closer to customers.
The complexity inherent with sharding is the main drawback. Improper implementation can lead to loss of data or data corruption. Maintaining multiple shards is not trivial and teams have to be trained in new workflows. Once sharded, it's hard to return to the unsharded architecture. Databases without native support for sharding will need custom solutions.
Since data is distributed across many servers, a JOIN operation easily done in a single SQL database becomes difficult. Such an operation may hit many shards and require merging the responses.
Shards may become unbalanced over time. In other words, some shards grow faster than others, thus becoming database hotspots. Resharding is the solution.
What resources or tools are available to facilitate database sharding?
Sharding in Amazon RDS for OLTP. Source: Zeng 2019.
Cassandra, HBase, HDFS, MongoDB and Redis are databases that support sharding. Sqlite, Memcached, Zookeeper, MySQL and PostgreSQL are databases that don't natively support sharding at the database layer.
For databases that don't offer built-in support, sharding logic has to reside in the application. To avoid this complexity in application code, middleware can help. Two open-source middleware are Apache ShardingSphere and Vitess. Plugins for ShardingSphere offer support for MySQL, PostgreSQL, SQLServer, and Oracle. Vitess supports MySQL and MariaDB.
MySQL itself doesn't offer sharding but MySQL NDB Cluster and MySQL Fabric can be used to achieve sharding. For sharding PostgreSQL, PL/Proxy, Postgres-XC/XL and Citus can be used.
On Google Cloud Platform, Cloud SQL and ProxySQL services can be used to shard PostgreSQL and MySQL databases. On AWS, Amazon RDS is a service that can implement a sharded database architecture. The DB engine can be MySQL, MariaDB, PostgreSQL, Oracle, SQL Server, and Amazon Aurora. For data analytics, Amazon Redshift can store data from all shards. Amazon RDS also facilitates resharding.
Could you share some best practices for database sharding?
Shard ID is suffixed to primary keys to keep them unique across shards. Source: Zeng 2019.
Consider sharding only when other approaches have failed. Main reasons for sharding include constraints on storage, processing, network bandwidth, regulations and geographic proximity.
Data analytics is usually performed on the entire dataset. Hence, sharding is more suited for Online Transaction Processing (OLTP) rather than for Online Analytical Processing (OLAP).
Tables with foreign key relationships can all share the same shard key. To keep primary keys unique across all shards for later OLAP, shard IDs can be suffixed to primary keys.
Sharding can be combined with replication. In fact, for high availability it's common to replicate shards. We could even duplicate some data across shards, such as duplicating chat messages in sender's and recipient's shards.
Monitor the performance of all shards in terms of CPU utilization, memory usage, and read/write performance. Consider resharding if there are hotspots.

Milestones

1983

It's predicted that distributed databases with specialized hardware will die. Multiprocessor systems are seen to be I/O limited. However, later years prove this prediction to be false. This is due to the growth of relational databases through the 1980s. These can be easily parallelized. High-speed networking to interconnect parallel processors become available. Teradata, Tandem and other startups successfully market parallel machines.

1986

DeWitt et al. present a parallelized database they call GAMMA. It consists of 20 VAX 11/750 processors, each with dedicated memory. Eight of them have disk drives for database storage. The machines are connected by 80Mbps token ring. They note four ways to split data: round-robin, hashed, range (user specified), and range (uniform distribution). In a later version of GAMMA (1990), each processor has its own disk storage, giving rise to the term shared-nothing architecture.

Jun
1992

Sharding strategies: (a) round-robin, (b) range, (c) list, and (d) hash. Source: Costa et al. 2015.

DeWitt and Gray apply dataflow programming paradigm to propose a distributed database architecture they call partitioned parallelism. Input data is partitioned and sent to multiple processors and memories. They propose four sharding strategies: round-robin, range, list and hash.

1996

The use of the term "shard" in a computing and storage context probably originates with the massively multiplayer online game called Ultima Online. The story is that the fictional world Sosaria was trapped in a crystal. When the crystal shattered, each shard contained a copy of Sosaria. Each copy of the world ran on its own server and database.

Aug
2005

In a presentation, Fitzpatrick explains how sharding helped LiveJournal scale their backend. The site launched in 1999 but grew to 7.9 million accounts by August 2005. With thousands of hits per second, sharding is adopted as the solution. The term "shard" is not used. Instead, the term user cluster is used. Within a cluster, replication is done.

2006

Google's Bigtable is a distributed storage that can handle petabytes of data across commodity servers. Web indexing, Google Earth, Google Analytics, Orkut, and Google Finance are some projects at Google using Bigtable.

2007

Comparing consistent hash sharding and range sharding. Source: Ranganathan 2020.

At Facebook, engineers start work on an NoSQL database called Cassandra. They're inspired by range sharding of Google's Bigtable and consistent hash sharding of Amazon's DynamoDB. They adopt linear hash sharding that is a hybrid of the two. It preserves the sort order but distribution is uneven. This is later changed to consistent hash sharding at the expense of range queries.

2010

The architecture of Vitess. Source: Constine 2018.

A team in YouTube creates Vitess to solve scalability challenges to its MySQL data storage. Before Vitess, YouTube tried replication to speed up reads, and then application-level sharding to speed up writes. With Vitess, routing logic can be removed from application code. Vitess sits between the application and the database. In February 2018, Vitess becomes a CNCF incubation project, graduating in November 2019.

References

Article Stats

2448

Words

Authors

Edits

Chats

Likes

20K

Hits

Cite As

Devopedia. 2020. "Database Sharding." Version 4, October 31. Accessed 2023-11-12. https://devopedia.org/database-sharding

Contributed by
3 authors

Last updated on
2020-10-31 03:30:22

design data optimization distributed storage

Resharding Databases
Database Scaling
Database Partitioning
Database Replication
Distributed Database
ACID Transactions

Database Sharding

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Database Sharding

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login