Database Compression

Compression that removes blank values (white) and encodes repetitive values. Source: Faust 2013.
Compression that removes blank values (white) and encodes repetitive values. Source: Faust 2013.

Database compression is a set of techniques that reorganizes database content to save on physical storage space and improve performance speeds. Compression can be achieved in two primary ways:

  • Lossless: Original data can be fully reconstructed from the compressed data.
  • Lossy: Reduction in data size due to deliberate compromise in quality.

Depending on the nature and extent of redundancy in the data patterns, the compression method might work within a data field, across a row/column of data, across an entire data page or in hierarchical data structures. Compression methods applied for text/primitive data are different from those for audio-video data.

While most compression algorithms are open standards, companies like Oracle employ proprietary algorithms too. Run-length encoding, prefix encoding, compression using clustering, sparse matrices, dictionary encoding are all popular standard compression techniques used in database applications.

Discussion

  • What are the benefits of database compression?
    • Size - The most obvious reason for database compression is reduction in the overall database storage footprint of the organization. It applies relational (tables), unstructured (files), indices, data transfer over the network and backup data. Depending on the data cardinality (extent of repetition in data values), compression can reduce storage consumption from 60% to as low as 20% of the original space. Sparsely populated tables (lots of zeros, spaces in data) compress much better.
    • Speed – DB Read operations become much faster (up to 10X) as smaller amounts of physical data need to be moved from disk to memory. However, performance of write operations is marginally affected as there is an extra step of decompression involved.
    • Resource Utilization - More data will fit into a page in the disk, in memory or in the buffer pool, thereby minimizing I/O costs. Because I/O occurs at the page level, a single I/O will retrieve more data. This increases the likelihood that data resides in the cache because more rows fit on a physical page. Compression also results in massive reduction in backup/restore times.
  • What are the disadvantages of database compression? When should we avoid it?
    Local symbol table is a necessary overhead. Source: Adapted from Shearer 2013, slide 4.
    Local symbol table is a necessary overhead. Source: Adapted from Shearer 2013, slide 4.

    Most compression algorithms build an internal encoding dictionary to manage the compression keywords. When database size is small, compressed files might be larger than uncompressed files due to additional dictionary creation.

    For any DB, compression and decompression is a task overhead above its regular DML/DDL operations. It consumes additional CPU/memory. So compression must be attempted only when the gain in CPU/memory due to optimised page reads is much larger than the compression overhead.

    While compression might happen in parallel as a background task, decompression introduces client-side latency as it happens in the foreground after a client query. However, commercial databases like Oracle support advanced compression techniques where DB read can happen without decompression.

    Compression is not recommended when the cardinality of data is poor. For numerical data and non-repetitive strings, don't compress your data. For data types like BLOB data (images, audio), depending on the compression algorithm, storage size can either reduce or increase.

    In summary, it's important to estimate likely storage and performance benefits for any algorithm before implementing on live databases.

  • What's the effect of compression on common data types?

    Compression is a DDL function that can be selectively applied to tables, indexes or partitions with CREATE, ALTER and BACKUP commands.

    Data compression applies to these database objects – heaps, clustered indexes, non-clustered indexes, partitions, indexed views.

    Row-level compression converts fixed length data types into variable length types. Fields created as fixed length types, such as char 100, might not fill the entire 100 characters for every record. So this works well for fixed length text and numeric fields (char, int, float). For example, storing 23 in an int column will require just 1 byte when compressed instead of the entire 4 bytes allocated. No space is consumed for NULL or 0 values.

    Page compression is more advanced. It internally invokes row compression. Commonality in page data is extracted and encoded, one column at a time or all columns together. Actual data is then replaced by the codes. Large objects are not directly compressed. Instead they are stored in a separate page for direct retrieval (LOB or BLOB types).

    Unicode values are compressed into 2 bytes, regardless of locale. This results in nearly 50% savings for most locales.

  • What are some popular database compression techniques?
    Prefix encoding. Source: Popov 2012.
    Prefix encoding. Source: Popov 2012.

    Databases avoid lossy compression at the backend. However, applications might accept a lower quality using lossy techniques, especially for images, audio, videos.

    Among lossless techniques, all compressed data is stored in binary. Run-Length Encoding is the simplest. In this technique, sequential data is scanned for repeated symbols, such as pixels in an image and replaced by short codes called "run". For a gray scale image, run length code is represented by {Si, Ci}, where Si is the symbol or intensity of pixels and Ci is the count or the number of repetitions of Si.

    Prefix encoding is another basic technique that works well only on homogeneous data. The first part of the data fields are matched for commonality, not the entire field. Hence, the name 'prefix' encoding.

    Dictionary compression is another page-level technique where common codes are consolidated into a dictionary and stored under the header row. The algorithm calculates the number of bits (X) needed to encode a single attribute of the column from unique attribute values. It then calculates how many of these X-bit encoded values can fit into 1, 2, 3, or 4 bytes.

  • How does database compression vary from generic data compression techniques?
    Storage Size Compression. Source: Jayaram 2018.
    Storage Size Compression. Source: Jayaram 2018.

    Data compression applies to within a field, how size of a single data item can be minimised. Here both lossless and lossy techniques are popular. On the other hand, database compression is an aggregate technique. It aims to compress data across values in a row, column or page. Along with applying compression at DB level, it's quite common to apply data compression at field level too.

    In image compression, for example, a lossy BMP-to-JPG conversion is done using sampling and quantisation at field level. Then, BLOB data fields might be compressed at DB level using lossless methods.

    While NULL value cannot be further compressed individually, when a series of NULL values occur in a DB, compression techniques such as the Sparse Column can optimise storage of the NULLs collectively.

    Database compression techniques not only compress actual data, they also act on derived entities like indices, views, clusters and heaps. However, for derived entities like indices, there's a small CPU overhead to reconstruct the key column values during index lookup or scans. This can be minimized by keeping the prefixes locally in the block.

  • How do compression techniques compare between relational and hierarchical databases?

    There is no major variation in the compression algorithms applied on relational and hierarchical DBs. It's just a difference in nomenclature. Hierarchical DBs have segments similar to tables in RDBMS. And fields in the segment compare with columns.

    However, in hierarchical databases, segments are implicitly joined with each other. Because of this, I/O paging process would vary, so algorithms must consider the page sizes.

  • How do large-scale software applications handle database compression?

    Facebook uses a real-time compression algorithm Zstandard with super fast decoders plus a dictionary compression for small data.

    Google uses Snappy, a high speed compressor algorithm in their BigTable and MapReduce systems.

    Commercial DBMS like SQL Server use row, column and page level compression. SQL Server uses a proprietary XPRESS algorithm for backup compression.

    Oracle uses an internal compression technique called Oracle Advanced Compression.

    Open source MySQL uses LZ77 algorithm for its InnoDB table compressions.

Milestones

1940

As early as the 1940s, Information Theory establishes the theoretical background behind lossless and lossy compression.

1970

In the 1970s, major relational databases–Ingres, MS SQL Server and Sybase–enter the commercial market. The support primitive keyword-based compression techniques built into their systems.

1977

The LZ77 lossless compression algorithm, as the name suggests, is released. This forms the basis of compression schemes for PNG and ZIP formats. Implementations of the algorithm spread across most commercial databases. Compression in MySQL uses LZ77 via the zlib library.

2007

Oracle Database 11g Release 1 introduces OLTP Table compression, now called Advanced Row Compression, which maintains compression during all types of data manipulation operations. Oracle has been implementing compression techniques in their DBs from much earlier. But after acquisition of Sun, the techniques become more robust and hardware integrated.

2013
In Oracle Database12c, compression method can be selected based on data usage. Source: Breysse 2014, slide 11.
In Oracle Database12c, compression method can be selected based on data usage. Source: Breysse 2014, slide 11.

Designed for the cloud, Oracle releases Oracle Database 12c. It includes compression methods used in earlier Oracle Database 11g release and adds new ones: heat map, automatic data optimization, temporal optimization, hybrid columnar compression. Heat map method monitors data usage and selects the most suitable compression method.

Sample Code

  • -- Row Level Compression sample for SQL Server
    -- Source : https://docs.microsoft.com
    -- Accessed : 2019-03-23
     
    USE AdventureWorks2012;  
    GO  
    EXEC sp_estimate_data_compression_savings 'Production', 'TransactionHistory', NULL, NULL, 'ROW' ;  
     
    ALTER TABLE Production.TransactionHistory REBUILD PARTITION = ALL  
    WITH (DATA_COMPRESSION = ROW);   
    GO
     

References

  1. Aftab, O., P. Cheung, A. Kim, S. Thakkar and N. Yeddanapudi. 2001. "Information Theory: Information Theory and the Digital Age." 6.933—Final Paper, The Structure of Engineering Revolutions, Massachusetts Institute of Technology, Cambridge. Accessed 2019-03-20.
  2. Bhatt, Deep. 2016. "Data compression techniques." LinkedIn Slideshare, December 04. Accessed 2019-03-20.
  3. Breysse, Jean-Philippe. 2014. "Heat Map and Automatic Data Optimization with Oracle Database 12c." Digicomp Academy Suisse Romande SA, on SlideShare, February 05. Accessed 2019-03-28.
  4. Dietrich, Erik. 2017. "A Look at the History of RDBMS." Monitis.com, October 31. Accessed 2019-03-20.
  5. ETHW. 2019. "History of Lossless Data Compression Algorithms." Engineering and Technology History Wiki. Accessed 2019-03-20.
  6. Facebook GitHub IO. 2019. "Zstandard." Facebook GitHub IO. Accessed 2019-03-20.
  7. Faust, David. 2013. "Shrink Your Storage Needs with c‑treeACE Data Compression." Faircom Corporation, September 30. Updated 2015-11-17. Accessed 2019-03-20.
  8. Haas, Mike. 2015. "PART 1 OF 3: Oracle Database Compression." Mythics, February 10. Accessed 2019-03-20.
  9. IBM. 2014a. "DB2 10.5 for Linux, UNIX, and Windows." IBM Knowledge Center, October 24. Accessed 2019-03-20.
  10. IBM. 2014b. "Comparison of hierarchical and relational databases." IBM Knowledge Center, October 24. Accessed 2019-03-20.
  11. Jayaram, Prashanth. 2018. "Overview of Data Compression in SQL Server." codingSight, Devart, December 06. Accessed 2019-03-20.
  12. Microsoft Docs. 2017a. "Unicode Compression Implementation." SQL Server 2017, March 14. Accessed 2019-03-20.
  13. Microsoft Docs. 2017b. "Data Compression." SQL Server 2017, August 31. Accessed 2019-03-20.
  14. Mullins, Craig S. 2002. "Optimizing database performance, part 3: Compression, page size, and more." TechTarget, November. Accessed 2019-03-20.
  15. MySQL Docs. 2019. "How Compression Works for InnoDB Tables." Section 14.12.5, MySQL 5.5 Reference Manual, Oracle Corporation. Accessed 2019-03-20.
  16. Oracle. 2016. "Compressing your Indexes: Index Key Compression (PART 1)." Oracle Blog, March 15. Accessed 2019-03-20.
  17. Oracle. 2019a. "Oracle Advanced Compression." Integrated Cloud Applications & Platform Services, Oracle. Accessed 2019-03-20.
  18. Oracle. 2019b. "Oracle Database 12c." Accessed 2019-03-28.
  19. Patel, Dhananjay, Vinayak Bhogan, and Alan Janson. 2013. "Simulation and Comparison of Various Lossless Data Compression Techniques based on Compression Ratio and Processing Delay." International Journal of Computer Applications, vol. 81, no. 14, pp. 31-35, November. Accessed 2019-03-20.
  20. Popov, Stoimen. 2012. "Algorithm of the Week: Data Compression with Prefix Encoding." DZone, February 07. Accessed 2019-03-20.
  21. Raichand, Priyanka and Rinkle Rani Aggarwal. 2013. "A Short Survey Of Data Compression Techniques For Column Oriented Databases." Journal of Global Research in Computer Science. Accessed 2019-03-14.
  22. Shearer, Paul. 2013. "EnterpriseOne and Database Compression." LinkedIn Slideshare, December 2013. Accessed 2019-03-14.
  23. Snappy. 2019. "Snappy." Google GitHub IO. Accessed 2019-03-20.
  24. Tkachenko, Vadim. 2016. "Evaluating Database Compression Methods." Blog, Percona, March 09. Accessed 2019-03-20.
  25. Trivedi, Jignesh. 2013. "SPARSE Column in SQL Server." C# Corner, February 07. Accessed 2019-03-20.
  26. UKEssays. 2018. "Types Of Data Compression Computer Science Essay." UKEssays, All Answers Ltd., November. Updated 2018-12-05. Accessed 2019-03-14.
  27. Zaitsev, Peter. 2008. "MySQL Blob Compression performance benefits." Blog, Percona, January 11. Accessed 2019-03-20.

Further Reading

  1. ETHW. 2019. "History of Lossless Data Compression Algorithms." Engineering and Technology History Wiki. Accessed 2019-03-20.
  2. Pavlo, Andy. 2017. "Lecture #11 – Database Compression." 15-721 Advanced Database Systems, Carnegie Mellon University Spring. Accessed 2019-03-28.
  3. Oracle Advanced Compression
  4. LaRock, Thomas. 2018. "When to Use Row or Page Compression in SQL Server." DZone, February 02. Accessed 2019-03-28.
  5. Westmann, Till, Donald Kossmann, Sven Helmer, and Guido Moerkotte. 2000. "The Implementation and Performance of Compressed Databases." SIGMOD Rec., ACM, vol. 29, no. 3, pp. 55-67, September. http://doi.acm.org/10.1145/362084.362137. Accessed 2019-03-28.
  6. Roth, Mark A. and Scott J. Van Horn. 1993. "Database Compression." SIGMOD Rec., ACM, vol. 22, no. 3, pp. 31-39, September.Accessed 2019-03-28.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
5
3
2052
2
2
617
1486
Words
8
Likes
22K
Hits

Cite As

Devopedia. 2019. "Database Compression." Version 7, March 28. Accessed 2024-06-26. https://devopedia.org/database-compression
Contributed by
2 authors


Last updated on
2019-03-28 10:21:45

Improve this article
  • Data Compression
  • Video Compression
  • Audio Compression
  • Image Compression
  • Run-Length Encoding
  • Lempel-Ziv-Welch Algorithm

Article Warnings

  • Readability score of this article is below 50 (49.1). Use shorter sentences. Use simpler words.