• Compression that removes blank values (white) and encodes repetitive values. Source: Faust 2013.
• In Oracle Database12c, compression method can be selected based on data usage. Source: Breysse 2014, slide 11.
• Local symbol table is a necessary overhead. Source: Adapted from Shearer 2013, slide 4.
• Prefix encoding. Source: Popov 2012.
• Storage Size Compression. Source: Jayaram 2018.

# Database Compression

999 DevCoins

arvindpdmn
370 DevCoins
Last updated by arvindpdmn
on 2019-03-28 10:21:45
on 2019-03-13 17:55:37

## Summary

Database compression is a set of techniques that reorganizes database content to save on physical storage space and improve performance speeds. Compression can be achieved in two primary ways:

• Lossless: Original data can be fully reconstructed from the compressed data.
• Lossy: Reduction in data size due to deliberate compromise in quality.

Depending on the nature and extent of redundancy in the data patterns, the compression method might work within a data field, across a row/column of data, across an entire data page or in hierarchical data structures. Compression methods applied for text/primitive data are different from those for audio-video data.

While most compression algorithms are open standards, companies like Oracle employ proprietary algorithms too. Run-length encoding, prefix encoding, compression using clustering, sparse matrices, dictionary encoding are all popular standard compression techniques used in database applications.

## Milestones

1940

As early as the 1940s, Information Theory establishes the theoretical background behind lossless and lossy compression.

1970

In the 1970s, major relational databases–Ingres, MS SQL Server and Sybase–enter the commercial market. The support primitive keyword-based compression techniques built into their systems.

1977

The LZ77 lossless compression algorithm, as the name suggests, is released. This forms the basis of compression schemes for PNG and ZIP formats. Implementations of the algorithm spread across most commercial databases. Compression in MySQL uses LZ77 via the zlib library.

2007

Oracle Database 11g Release 1 introduces OLTP Table compression, now called Advanced Row Compression, which maintains compression during all types of data manipulation operations. Oracle has been implementing compression techniques in their DBs from much earlier. But after acquisition of Sun, the techniques become more robust and hardware integrated.

2013

Designed for the cloud, Oracle releases Oracle Database 12c. It includes compression methods used in earlier Oracle Database 11g release and adds new ones: heat map, automatic data optimization, temporal optimization, hybrid columnar compression. Heat map method monitors data usage and selects the most suitable compression method.

## Discussion

• What are the benefits of database compression?
• Size - The most obvious reason for database compression is reduction in the overall database storage footprint of the organization. It applies relational (tables), unstructured (files), indices, data transfer over the network and backup data. Depending on the data cardinality (extent of repetition in data values), compression can reduce storage consumption from 60% to as low as 20% of the original space. Sparsely populated tables (lots of zeros, spaces in data) compress much better.
• Speed – DB Read operations become much faster (up to 10X) as smaller amounts of physical data need to be moved from disk to memory. However, performance of write operations is marginally affected as there is an extra step of decompression involved.
• Resource Utilization - More data will fit into a page in the disk, in memory or in the buffer pool, thereby minimizing I/O costs. Because I/O occurs at the page level, a single I/O will retrieve more data. This increases the likelihood that data resides in the cache because more rows fit on a physical page. Compression also results in massive reduction in backup/restore times.
• What are the disadvantages of database compression? When should we avoid it?

Most compression algorithms build an internal encoding dictionary to manage the compression keywords. When database size is small, compressed files might be larger than uncompressed files due to additional dictionary creation.

For any DB, compression and decompression is a task overhead above its regular DML/DDL operations. It consumes additional CPU/memory. So compression must be attempted only when the gain in CPU/memory due to optimised page reads is much larger than the compression overhead.

While compression might happen in parallel as a background task, decompression introduces client-side latency as it happens in the foreground after a client query. However, commercial databases like Oracle support advanced compression techniques where DB read can happen without decompression.

Compression is not recommended when the cardinality of data is poor. For numerical data and non-repetitive strings, don't compress your data. For data types like BLOB data (images, audio), depending on the compression algorithm, storage size can either reduce or increase.

In summary, it's important to estimate likely storage and performance benefits for any algorithm before implementing on live databases.

• What's the effect of compression on common data types?

Compression is a DDL function that can be selectively applied to tables, indexes or partitions with CREATE, ALTER and BACKUP commands.

Data compression applies to these database objects – heaps, clustered indexes, non-clustered indexes, partitions, indexed views.

Row-level compression converts fixed length data types into variable length types. Fields created as fixed length types, such as char 100, might not fill the entire 100 characters for every record. So this works well for fixed length text and numeric fields (char, int, float). For example, storing 23 in an int column will require just 1 byte when compressed instead of the entire 4 bytes allocated. No space is consumed for NULL or 0 values.

Page compression is more advanced. It internally invokes row compression. Commonality in page data is extracted and encoded, one column at a time or all columns together. Actual data is then replaced by the codes. Large objects are not directly compressed. Instead they are stored in a separate page for direct retrieval (LOB or BLOB types).

Unicode values are compressed into 2 bytes, regardless of locale. This results in nearly 50% savings for most locales.

• What are some popular database compression techniques?

Databases avoid lossy compression at the backend. However, applications might accept a lower quality using lossy techniques, especially for images, audio, videos.

Among lossless techniques, all compressed data is stored in binary. Run-Length Encoding is the simplest. In this technique, sequential data is scanned for repeated symbols, such as pixels in an image and replaced by short codes called "run". For a gray scale image, run length code is represented by {Si, Ci}, where Si is the symbol or intensity of pixels and Ci is the count or the number of repetitions of Si.

Prefix encoding is another basic technique that works well only on homogeneous data. The first part of the data fields are matched for commonality, not the entire field. Hence, the name 'prefix' encoding.

Dictionary compression is another page-level technique where common codes are consolidated into a dictionary and stored under the header row. The algorithm calculates the number of bits (X) needed to encode a single attribute of the column from unique attribute values. It then calculates how many of these X-bit encoded values can fit into 1, 2, 3, or 4 bytes.

• How does database compression vary from generic data compression techniques?

Data compression applies to within a field, how size of a single data item can be minimised. Here both lossless and lossy techniques are popular. On the other hand, database compression is an aggregate technique. It aims to compress data across values in a row, column or page. Along with applying compression at DB level, it's quite common to apply data compression at field level too.

In image compression, for example, a lossy BMP-to-JPG conversion is done using sampling and quantisation at field level. Then, BLOB data fields might be compressed at DB level using lossless methods.

While NULL value cannot be further compressed individually, when a series of NULL values occur in a DB, compression techniques such as the Sparse Column can optimise storage of the NULLs collectively.

Database compression techniques not only compress actual data, they also act on derived entities like indices, views, clusters and heaps. However, for derived entities like indices, there's a small CPU overhead to reconstruct the key column values during index lookup or scans. This can be minimized by keeping the prefixes locally in the block.

• How do compression techniques compare between relational and hierarchical databases?

There is no major variation in the compression algorithms applied on relational and hierarchical DBs. It's just a difference in nomenclature. Hierarchical DBs have segments similar to tables in RDBMS. And fields in the segment compare with columns.

However, in hierarchical databases, segments are implicitly joined with each other. Because of this, I/O paging process would vary, so algorithms must consider the page sizes.

• How do large-scale software applications handle database compression?

Facebook uses a real-time compression algorithm Zstandard with super fast decoders plus a dictionary compression for small data.

Google uses Snappy, a high speed compressor algorithm in their BigTable and MapReduce systems.

Commercial DBMS like SQL Server use row, column and page level compression. SQL Server uses a proprietary XPRESS algorithm for backup compression.

Oracle uses an internal compression technique called Oracle Advanced Compression.

Open source MySQL uses LZ77 algorithm for its InnoDB table compressions.

## Sample Code

• -- Row Level Compression sample for SQL Server
-- Source : https://docs.microsoft.com
-- Accessed : 2019-03-23

GO
EXEC sp_estimate_data_compression_savings 'Production', 'TransactionHistory', NULL, NULL, 'ROW' ;

ALTER TABLE Production.TransactionHistory REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = ROW);
GO

## Milestones

1940

As early as the 1940s, Information Theory establishes the theoretical background behind lossless and lossy compression.

1970

In the 1970s, major relational databases–Ingres, MS SQL Server and Sybase–enter the commercial market. The support primitive keyword-based compression techniques built into their systems.

1977

The LZ77 lossless compression algorithm, as the name suggests, is released. This forms the basis of compression schemes for PNG and ZIP formats. Implementations of the algorithm spread across most commercial databases. Compression in MySQL uses LZ77 via the zlib library.

2007

Oracle Database 11g Release 1 introduces OLTP Table compression, now called Advanced Row Compression, which maintains compression during all types of data manipulation operations. Oracle has been implementing compression techniques in their DBs from much earlier. But after acquisition of Sun, the techniques become more robust and hardware integrated.

2013

Designed for the cloud, Oracle releases Oracle Database 12c. It includes compression methods used in earlier Oracle Database 11g release and adds new ones: heat map, automatic data optimization, temporal optimization, hybrid columnar compression. Heat map method monitors data usage and selects the most suitable compression method.

## Tags

• Data Compression
• Video Compression
• Audio Compression
• Image Compression
• Run-Length Encoding
• Lempel-Ziv-Welch Algorithm

Author
No. of Edits
No. of Chats
DevCoins
5
3
999
2
2
370
1486
Words
5
Chats
7
Edits
3
Likes
901
Hits

## Cite As

Devopedia. 2019. "Database Compression." Version 7, March 28. Accessed 2019-09-23. https://devopedia.org/database-compression
• Site Map