Learning Big Data : Amazon RedShift: Key highlights on "Introduction" and "Best Practices For Designing a Table"

Sunday, April 2, 2017

Amazon RedShift: Key highlights on "Introduction" and "Best Practices For Designing a Table"

Amazon Redshift is a MPP (massive parallel processing) peta-byte scale data warehouse service hosted on Amazon Web Services. Here I have provided some key highlights or important points to remember about Amazon RedShift. To refresh your knowledge on Amazon Redshift the below provided information is wealth.

This post contains highlights on Amazon RedShift Introduction and Best Practices For Designing a Table.

Amazon Redshift is a columnar database.
Tables can be distributed as

Distribution on unique key
Distribution on non-unique key
Distribution Style is ALL
Distribution Style is EVEN
Matching Distribution Keys for co-location of Joins
Big table/Small table joins

Defining a Sort Key helps in performance improvement. Data is stored on disk in sorted order.
Columns that are good candidates to be defined as sort key are

Timestamp or date columns on which we frequently access the recent data
Columns on which we perform range or equality filtering
Columns frequently used for order by, group by or window functions
The joining column, on which two tables can be joined

Columns that are good candidates to be defined as distributed key:

To distribute data evenly among the nodes and slices in a cluster
To collocate data for joins and aggregations

When two tables are frequently joined on a join column, specify that join column as both sort key and distribution key.
RedShift stores columnar data in 1MB disk blocks.
Each block comes with metadata. The min and max values of keys for each block are stored as part of metadata. This helps range or equality queries to quickly identify the block which matches the filter condition.

Analyze command collects statistics

Analyze; -> analyzes all tables in current database
Analyze verboze; -> Analyzes all tables in current database and report progress
analyze table -> analyzes the table specified
analyze table(col1, col2) -> analyzes the columns col1 and col2 in table

Redshift automatically analyzes some tables created with the following commands.

Create table as
create temp table as
select into

Vacuum:

Redshift doesn't automatically reclaim or reuse space that is freed by delete and update operations. These rows are logically deleted but not physically. Vacuum will reclaim the space.
Redshift during update operation marks the old row as delete and inserts a new row. So every update is a delete followed by insert. The sort order might be disturbed by updates.
When delete operation is performed, the row is marked for delete but not removed until Vacuum.

Vacuum operation is time consuming, so its preferred to run it during maintenance window.
Vacuum commands:

vacuum; -> reclaims space of all tables in current database
vacuum table -> reclaims space of specified table.
vacuum sort only table -> resorts rows for table
vacuum delete only table -> claims space from deletes and updates for table

Best Practices for designing tables

Choose the best sort key
Choose the great distribution key
Consider defining primary key and foreign key constraints
Use the smallest possible column size
Use date/time data types for date columns
Specify redundant predicates on the sort column

Primary key, foreign key, unique key constraints are not enforced by Redshift. They are for informational purpose and is used by query planner to generate a better query plan.
RedShift does enforce NOT NULL constraints.
Redshift stores Date and Timestamp data more efficiently than other datatypes.
While performing joins, specify redundant predicates so unwanted blocks are skipped.

To continue, go to my next post on System Tables.

Learning Big Data

Sunday, April 2, 2017

Amazon RedShift: Key highlights on "Introduction" and "Best Practices For Designing a Table"

No comments:

Post a Comment

Amazon S3: Basic Concepts

Total Pageviews

Search This Blog