Learning Big Data : Amazon RedShift: Key highlights on "Compression"

Sunday, April 2, 2017

This post highlights important points on Compression in Redshift.

Byte Dictionary encoding stores a dictionary of unique values for each block of column values. Due to 1MB disk block size, dictionary can hold upto 256 unique values in a single block. Any more distinct values above 256 are stored in raw, uncompressed form. This encoding is effective when distinct values are less than 256.
Delta encoding is useful mostly for date and time columns. This encoding compress data by storing the difference between values that follow each other in the column. DELTA records difference as 1 byte values, DELTA 32k records difference as 2-byte values.
LZO encoding works best with character strings. LZO provides high compression ratio, so compression is little slower but supports extremely fast decompression.
Mostly encoding is useful when the declared data type for a column has large capacity than the majority of values actually stored.
Run Length encoding is useful when a column has consecutively repetitive data values. This encoding replaces the consecutively repetitive value in a column with a token that contains the value and number of consecutive occurrences.
Text255 and Text32k encodings are useful for compressing VARCHAR columns only. A separate dictionary is created with unique words in the column value and an index value is associated with each unique word.
Analyze Compression Commands performs compression analysis and makes suggestions. To implement suggestions, we should recreate table.

analyze compression : Analyzes all the tables in current DB
analyze compression table : analyzes the table specified. More than one table cant be specified in this command.
analyze compression table_name column_name : analyzes the column specified. More than one column can be specified.
analyze compression comprows : This is the number of rows <numrows> to be used as the sample size for compression analysis.
analyze compression comprows numrows: numrows is the number of rows to be used as the compression sample size.

Learning Big Data