Amazon Redshift is a MPP (massive parallel processing) peta-byte scale data warehouse service hosted on Amazon Web Services. Here I have provided some key highlights or important points to remember about Amazon RedShift. To refresh your knowledge on Amazon Redshift the below provided information is wealth.
This post contains highlights on Amazon RedShift Introduction and Best Practices For Designing a Table.
This post contains highlights on Amazon RedShift Introduction and Best Practices For Designing a Table.
- Amazon Redshift is a columnar database.
- Tables can be distributed as
- Distribution on unique key
- Distribution on non-unique key
- Distribution Style is ALL
- Distribution Style is EVEN
- Matching Distribution Keys for co-location of Joins
- Big table/Small table joins
- Defining a Sort Key helps in performance improvement. Data is stored on disk in sorted order.
- Columns that are good candidates to be defined as sort key are
- Timestamp or date columns on which we frequently access the recent data
- Columns on which we perform range or equality filtering
- Columns frequently used for order by, group by or window functions
- The joining column, on which two tables can be joined
- Columns that are good candidates to be defined as distributed key:
- To distribute data evenly among the nodes and slices in a cluster
- To collocate data for joins and aggregations
- When two tables are frequently joined on a join column, specify that join column as both sort key and distribution key.
- RedShift stores columnar data in 1MB disk blocks.
- Each block comes with metadata. The min and max values of keys for each block are stored as part of metadata. This helps range or equality queries to quickly identify the block which matches the filter condition.
- Analyze command collects statistics
- Analyze; -> analyzes all tables in current database
- Analyze verboze; -> Analyzes all tables in current database and report progress
- analyze table -> analyzes the table specified
- analyze table(col1, col2) -> analyzes the columns col1 and col2 in table
- Redshift automatically analyzes some tables created with the following commands.
- Create table as
- create temp table as
- select into
- Vacuum:
- Redshift doesn't automatically reclaim or reuse space that is freed by delete and update operations. These rows are logically deleted but not physically. Vacuum will reclaim the space.
- Redshift during update operation marks the old row as delete and inserts a new row. So every update is a delete followed by insert. The sort order might be disturbed by updates.
- When delete operation is performed, the row is marked for delete but not removed until Vacuum.
- Vacuum operation is time consuming, so its preferred to run it during maintenance window.
- Vacuum commands:
- vacuum; -> reclaims space of all tables in current database
- vacuum table -> reclaims space of specified table.
- vacuum sort only table -> resorts rows for table
- vacuum delete only table -> claims space from deletes and updates for table
- Best Practices for designing tables
- Choose the best sort key
- Choose the great distribution key
- Consider defining primary key and foreign key constraints
- Use the smallest possible column size
- Use date/time data types for date columns
- Specify redundant predicates on the sort column
- Primary key, foreign key, unique key constraints are not enforced by Redshift. They are for informational purpose and is used by query planner to generate a better query plan.
- RedShift does enforce NOT NULL constraints.
- Redshift stores Date and Timestamp data more efficiently than other datatypes.
- While performing joins, specify redundant predicates so unwanted blocks are skipped.
To continue, go to my next post on System Tables.
No comments:
Post a Comment