Database Compression

Article Info

Contributed by
2 authors

Last updated on
2019-03-28 10:21:45

Improve this article

Data Compression
Video Compression
Audio Compression
Image Compression
Run-Length Encoding
Lempel-Ziv-Welch Algorithm

Article Versions

7 2019-03-28 10:21:45
1290,1289 7,1290

By arvindpdmn

Publishing. New milestone added with image. Further Reading completed. Image captions updated with citation rather than full URL.
6 2019-03-27 18:22:05
1289,1282 6,1289

By anuradhac

Updated images. Updated Milestones, added citations
5 2019-03-24 10:15:01
1282,1278 5,1282

By arvindpdmn

Updates to many sections. Needs more images and milestones.
4 2019-03-22 19:37:13
1278,1277 4,1278

By anuradhac

0.4 Ready for review
3 2019-03-22 19:34:54
1277,1255 3,1277 1

By anuradhac

0.3

Chat Room

Submitting ...

You are editing an existing chat message.
2019-03-27 18:22:55
-

By anuradhac

Rework done as per your comments. Please verify and publish.
2019-03-24 09:34:15
-

By arvindpdmn

1. Identify See Also items first along with questions. This will clarify the scope of current article.
2. Short sentences. Avoid running sentences with commas.
3. URL in Sample Code: we'll see if the character limit can be removed in next release. Add accessed date.
4. Sample Code: comments should follow the syntax of the language: -- for SQL.
5. Citation for media: mention in References section and link rather than give full URL.
6. Image is Summary: okay but improve caption: "Compression that removes blank values (white) and encodes repetitive values"
7. Second image is not relevant to question. Can we move the image elsewhere?
8. Give citations to current Milestones.
9. Use of hyphen: when used as an adjective.
10. Version note: give description of changes instead of 0.2, 0.3, etc. Description is useful in listings: Dashboard, profile page, article versions, etc.
2019-03-22 19:36:18
-

By anuradhac

Ready for review. I am not happy with milestones, will rework along with your comments.
2019-03-16 10:33:07
-

By arvindpdmn

Just saw your comment. Actually, moderators get notified for article edits but not for chats. Chat activities notified only to article authors or other chatters. You can mention in version note that it's ready for review.

1. First impression is too many questions but may be okay. Will take a call after the answers are in place.
2. Good to see a question comparing against generic data compression. In the rest of the article, can focus on database compression.
3. Any difference between compressing DB data vs DB indices? Or do they use same techniques?
4. When talking about common data types, also cover Unicode compression. Example: https://docs.microsoft.com/en-us/sql/relational-databases/data-compression/unicode-compression-implementation?view=sql-server-2017
5. What's used in FB or Google? Can expand to underlying databases: Oracle, SAP, MySQL, SQL Server, etc.

Other points you may be already covering but mentioning here:
6. Effect of block size, algo, compress vs. decompress performance: https://www.percona.com/blog/2016/03/09/evaluating-database-compression-methods/
7. Important to have an intuitive feel: typical compression ratio, compress/decompress lacencies
2019-03-14 17:07:38
-

By anuradhac

Please review summary and Qs. Ref format also updated according to your directions. Pls check. Will change the ref formats in my older articles also.

Compression that removes blank values (white) and encodes repetitive values. Source: Faust 2013.

Database compression is a set of techniques that reorganizes database content to save on physical storage space and improve performance speeds. Compression can be achieved in two primary ways:

Lossless: Original data can be fully reconstructed from the compressed data.
Lossy: Reduction in data size due to deliberate compromise in quality.

Depending on the nature and extent of redundancy in the data patterns, the compression method might work within a data field, across a row/column of data, across an entire data page or in hierarchical data structures. Compression methods applied for text/primitive data are different from those for audio-video data.

While most compression algorithms are open standards, companies like Oracle employ proprietary algorithms too. Run-length encoding, prefix encoding, compression using clustering, sparse matrices, dictionary encoding are all popular standard compression techniques used in database applications.

Discussion

What are the benefits of database compression?
- Size - The most obvious reason for database compression is reduction in the overall database storage footprint of the organization. It applies relational (tables), unstructured (files), indices, data transfer over the network and backup data. Depending on the data cardinality (extent of repetition in data values), compression can reduce storage consumption from 60% to as low as 20% of the original space. Sparsely populated tables (lots of zeros, spaces in data) compress much better.
- Speed – DB Read operations become much faster (up to 10X) as smaller amounts of physical data need to be moved from disk to memory. However, performance of write operations is marginally affected as there is an extra step of decompression involved.
- Resource Utilization - More data will fit into a page in the disk, in memory or in the buffer pool, thereby minimizing I/O costs. Because I/O occurs at the page level, a single I/O will retrieve more data. This increases the likelihood that data resides in the cache because more rows fit on a physical page. Compression also results in massive reduction in backup/restore times.
What are the disadvantages of database compression? When should we avoid it?
Local symbol table is a necessary overhead. Source: Adapted from Shearer 2013, slide 4.
Most compression algorithms build an internal encoding dictionary to manage the compression keywords. When database size is small, compressed files might be larger than uncompressed files due to additional dictionary creation.
For any DB, compression and decompression is a task overhead above its regular DML/DDL operations. It consumes additional CPU/memory. So compression must be attempted only when the gain in CPU/memory due to optimised page reads is much larger than the compression overhead.
While compression might happen in parallel as a background task, decompression introduces client-side latency as it happens in the foreground after a client query. However, commercial databases like Oracle support advanced compression techniques where DB read can happen without decompression.
Compression is not recommended when the cardinality of data is poor. For numerical data and non-repetitive strings, don't compress your data. For data types like BLOB data (images, audio), depending on the compression algorithm, storage size can either reduce or increase.
In summary, it's important to estimate likely storage and performance benefits for any algorithm before implementing on live databases.
What's the effect of compression on common data types?
Compression is a DDL function that can be selectively applied to tables, indexes or partitions with CREATE, ALTER and BACKUP commands.
Data compression applies to these database objects – heaps, clustered indexes, non-clustered indexes, partitions, indexed views.
Row-level compression converts fixed length data types into variable length types. Fields created as fixed length types, such as char 100, might not fill the entire 100 characters for every record. So this works well for fixed length text and numeric fields (char, int, float). For example, storing 23 in an int column will require just 1 byte when compressed instead of the entire 4 bytes allocated. No space is consumed for NULL or 0 values.
Page compression is more advanced. It internally invokes row compression. Commonality in page data is extracted and encoded, one column at a time or all columns together. Actual data is then replaced by the codes. Large objects are not directly compressed. Instead they are stored in a separate page for direct retrieval (LOB or BLOB types).
Unicode values are compressed into 2 bytes, regardless of locale. This results in nearly 50% savings for most locales.
What are some popular database compression techniques?
Prefix encoding. Source: Popov 2012.
Databases avoid lossy compression at the backend. However, applications might accept a lower quality using lossy techniques, especially for images, audio, videos.
Among lossless techniques, all compressed data is stored in binary. Run-Length Encoding is the simplest. In this technique, sequential data is scanned for repeated symbols, such as pixels in an image and replaced by short codes called "run". For a gray scale image, run length code is represented by {Si, Ci}, where Si is the symbol or intensity of pixels and Ci is the count or the number of repetitions of Si.
Prefix encoding is another basic technique that works well only on homogeneous data. The first part of the data fields are matched for commonality, not the entire field. Hence, the name 'prefix' encoding.
Dictionary compression is another page-level technique where common codes are consolidated into a dictionary and stored under the header row. The algorithm calculates the number of bits (X) needed to encode a single attribute of the column from unique attribute values. It then calculates how many of these X-bit encoded values can fit into 1, 2, 3, or 4 bytes.
How does database compression vary from generic data compression techniques?
Storage Size Compression. Source: Jayaram 2018.
Data compression applies to within a field, how size of a single data item can be minimised. Here both lossless and lossy techniques are popular. On the other hand, database compression is an aggregate technique. It aims to compress data across values in a row, column or page. Along with applying compression at DB level, it's quite common to apply data compression at field level too.
In image compression, for example, a lossy BMP-to-JPG conversion is done using sampling and quantisation at field level. Then, BLOB data fields might be compressed at DB level using lossless methods.
While NULL value cannot be further compressed individually, when a series of NULL values occur in a DB, compression techniques such as the Sparse Column can optimise storage of the NULLs collectively.
Database compression techniques not only compress actual data, they also act on derived entities like indices, views, clusters and heaps. However, for derived entities like indices, there's a small CPU overhead to reconstruct the key column values during index lookup or scans. This can be minimized by keeping the prefixes locally in the block.
How do compression techniques compare between relational and hierarchical databases?
There is no major variation in the compression algorithms applied on relational and hierarchical DBs. It's just a difference in nomenclature. Hierarchical DBs have segments similar to tables in RDBMS. And fields in the segment compare with columns.
However, in hierarchical databases, segments are implicitly joined with each other. Because of this, I/O paging process would vary, so algorithms must consider the page sizes.
How do large-scale software applications handle database compression?
Facebook uses a real-time compression algorithm Zstandard with super fast decoders plus a dictionary compression for small data.
Google uses Snappy, a high speed compressor algorithm in their BigTable and MapReduce systems.
Commercial DBMS like SQL Server use row, column and page level compression. SQL Server uses a proprietary XPRESS algorithm for backup compression.
Oracle uses an internal compression technique called Oracle Advanced Compression.
Open source MySQL uses LZ77 algorithm for its InnoDB table compressions.

Milestones

1940

As early as the 1940s, Information Theory establishes the theoretical background behind lossless and lossy compression.

1970

In the 1970s, major relational databases–Ingres, MS SQL Server and Sybase–enter the commercial market. The support primitive keyword-based compression techniques built into their systems.

1977

The LZ77 lossless compression algorithm, as the name suggests, is released. This forms the basis of compression schemes for PNG and ZIP formats. Implementations of the algorithm spread across most commercial databases. Compression in MySQL uses LZ77 via the zlib library.

2007

Oracle Database 11g Release 1 introduces OLTP Table compression, now called Advanced Row Compression, which maintains compression during all types of data manipulation operations. Oracle has been implementing compression techniques in their DBs from much earlier. But after acquisition of Sun, the techniques become more robust and hardware integrated.

2013

In Oracle Database12c, compression method can be selected based on data usage. Source: Breysse 2014, slide 11.

Designed for the cloud, Oracle releases Oracle Database 12c. It includes compression methods used in earlier Oracle Database 11g release and adds new ones: heat map, automatic data optimization, temporal optimization, hybrid columnar compression. Heat map method monitors data usage and selects the most suitable compression method.

Sample Code

-- Row Level Compression sample for SQL Server
-- Source : https://docs.microsoft.com
-- Accessed : 2019-03-23
 
USE AdventureWorks2012;  
GO  
EXEC sp_estimate_data_compression_savings 'Production', 'TransactionHistory', NULL, NULL, 'ROW' ;  
 
ALTER TABLE Production.TransactionHistory REBUILD PARTITION = ALL  
WITH (DATA_COMPRESSION = ROW);   
GO

References

Article Stats

1486

Words

Authors

Edits

Chats

Likes

20K

Hits

Cite As

Devopedia. 2019. "Database Compression." Version 7, March 28. Accessed 2023-11-12. https://devopedia.org/database-compression

Contributed by
2 authors

Last updated on
2019-03-28 10:21:45

Improve this article

design data algorithms database

Data Compression
Video Compression
Audio Compression
Image Compression
Run-Length Encoding
Lempel-Ziv-Welch Algorithm

Database Compression

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Article Warnings

Login