Learning Big Data

Monday, April 3, 2017

Amazon RedShift: Key highlights on "Explain"

Terminology used in EXPLAIN PLAN in Amazon Redshift is briefly explained in this post.

To get the explain plan of a query, include EXPLAIN in front of any query.
Collecting statistics of the tables by analyzing them is important to get correct estimates in explain plan.
Terminology used in EXPLAIN:
STEP: Every individual operation is a step in explain plan.
SEGMENT: Segments are the number of steps that can be done by a single process.
STREAM: A collection of segments that always begin with scan or reading of data and ends with materialization or blocking up.
LAST SEGMENT: The term last segment means the query returns the data. If the return set is aggregated or sorted, the intermediate data is sent to leader node from all compute nodes. Leader node collects the data and sends back to the requesting client.
SEQUENTIAL SCAN: Also termed as scan. Data is read sequentially from beginning to end.
MERGE JOIN: Also termed as mjoin. This is the fastest Redshift join. This is used for inner joins and outer joins that are both distributed and sorted on join keys.
HASH JOIN: Also termed as hjoin. This is based on hashing the joining columns. Its faster than nested loop join.
NESTED LOOP JOIN: Also termed as nloop. Its used for cross joins, joins with inequality conditions. Its the slowest join of all.
AGGREGATE: Also termed as aggr. This keyword is used for scalar aggregation functions. Scalar agg. functions returns only one row and one column.
HASHAGGREGATE: Also termed as aggr. This is used for unsorted grouped aggregate functions.
GROUPAGGREGATE: Also termed as aggr.
SORT: Also termed as sort. ORDER BY controls this sort.
MERGE: Also termed as merge. This produces the final results based on intermediate sorted results derived from parallel operations.
SetOp Except: Also termed as hjoin. This is only used for EXCEPT queries.
HASH Intersect: Also termed as hjoin. This is used for INTERSECT queries.
Append: Also termed as save. This is the append used with subquery scan to implement UNION and UNION ALL queries.
LIMIT: Also termed as limit. This is used with LIMIT clause.
MATERIALIZE: Also termed as save.
UNIQUE: Also termed as unique. Used mostly when DISTINCT keyword is used.
WINDOW: Also termed as window. This term means to compute window functions.
Network (Broadcast): Also termed as bcast. This is a Broadcast that is considered an attribute of the Join Explain operators and steps.

DS_BCAST_INNER: In joins, it means we are broadcasting the entire inner table to all the compute nodes.

Network (Distribute) : Also termed as dist. This is used to distribute rows to compute nodes for parallel processing.

DS_DIST_NONE: No Data moves. Since joining tables have same distkey as join keys, data resides on the same node.

Network(Send to Leader) - Also termed as return. Sends the results to Leader node for furthur processing.
INSERT(Using Results) - Also termed as insert. Inserts data.
DELETE(Scan and Filter) - Also termed as delete. Deletes data.
UPDATE(Scan and Filter) - Also termed as delete,insert. Update is actually both delete and then insert.
COST 0.00 ..0.09: Costs are cumulative as we read up the plan. 0.00 is the cost to read first row and 0.09 is the cost to read all rows.
ROWS: The expected number of rows to return.
WIDTH: The estimated average row size (width) of all the rows, in bytes.

Visit my post on "Key highlights on SQL Functions, Commands, Aggregate Functions, OLAP functions, Date/Time functions, Queries" here.

Amazon RedShift: Key highlights on "Temporary Tables" and "Derived Tables'

Important points to remember about Temporary Tables are

Temporary tables are visible only within the current session. They are automatically dropped at the end of the session.
If we begin the table name with #, it will automatically creates a temporary table.
Operations on temporary tables

create table #employee_temp ...

This sql automatically creates employee_temp as temporary table due to # at the front of the table name.

insert into #employee_temp select * from employee;

# is required at the front of the temporary table name while inserting.

create temp table employee_temp like employee;

This command creates a temporary table employee_temp. It inherits its columns, Distkey ,sortkey and NOT NULL from employee table.

create temp table employee_temp as select * from employee;

This command creates a temporary table employee_temp. The temporary table only inherits the column names.

create temp table employee_temp as select firstname,lastname from employee;

Creates temporary table employee_temp that has firstname and lastname from employee.

create temp table emp_temp distkey(emp_id) as select * from employee;
create temp table emp_temp diststyle even as select * from employee;
create temp table emp_temp diststyle even sortkey(emp_name) as select * from employee;

A derived table lasts the life of a query but temporary table lasts the life of a session.
A temporary table can have the same name as permanent table,but not recommended.
All Users by default has temporary table creation privilege. To remove, revoke TEMP privilege from PUBLIC and grant TEMP privilege to specific user/group.
Deep Copy:

Deep copy operation recreates and repopulates the table using bulk insert and also automatically sorts the table.
If a table has large unsorted region, deep copy is a preferred method compared to vacuum to sort the table.
4 Methods to perform deep copy:

Use original DDL command - CREATE TABLE command. We can specify all PK, FK, distkey, sortkey in DDL command.
CREATE TABLE LIKE - The new table doesn't inherit PK and FK. It only inherits distkey, sortkey and not null attributes from parent table.
CTAS - The new table will not inherit PK, FK, not null, distkey, sortkey from parent table.
Create temporary table - If we need to retain all the attributes of the original table, then we have to create temporary table using CTAS command. Then, truncate parent table and then insert into parent table from temporary table.

During deep copy operation concurrent updates are not allowed whereas its allowed during vacuum.
v_generate_tbl_ddl script is used to generate the table DDL.

Derived Tables:

Exists only within query.
Analogous to subquery concept in RDBMS world.
Derived table lives in memory.
Derived table must be given name.
All columns must be given name in the derived table.
Derived table is materialized by a select statement inside the query and exists between open and closed parenthesis.
Different ways to define derived tables:

select * from (select dept,avg(salary) from employee) as salavg(dept, avgsal);

salavg is the name given to derived table. dept, avgsal are aliases defined externally to dept, avg(salary) columns.

select * from (select dept,avg(salary) avgsal from employee) as salavg;

Salavg is the name given to derived table. dept column is not aliased because it can default to normal column and avgsal alias is defined within the derived table query.

WITH salavg(avgsal) as (select avg(salary) from employee) select * from salavg;

Using with command we create derived table with column alias before running the main query.

WITH emp as (select * from employee) select * from emp;

Derived table emp selects all rows and columns from main table employee.

The following query creates two derived tables using WITH command and then performs a join.

WITH emp(deptid,avgsal) as (select deptid, avg(sal) from employee) ,
dept as (select deptid,deptname from department)
select emp.deptid, dept.deptname, avgsal from emp inner join dept
on emp.deptid = dept.deptid;

The derived table is built first while executing a query.

Check my post on Key highlights on Explain.

Sunday, April 2, 2017

Amazon RedShift: Key highlights on "Compression"

This post highlights important points on Compression in Redshift.

Compression is at column-level in Redshift.
By default, Amazon Redshift stores data in raw and uncompressed format.
Types of compression supported by Redshift are

Byte Dictionary encoding stores a dictionary of unique values for each block of column values. Due to 1MB disk block size, dictionary can hold upto 256 unique values in a single block. Any more distinct values above 256 are stored in raw, uncompressed form. This encoding is effective when distinct values are less than 256.
Delta encoding is useful mostly for date and time columns. This encoding compress data by storing the difference between values that follow each other in the column. DELTA records difference as 1 byte values, DELTA 32k records difference as 2-byte values.
LZO encoding works best with character strings. LZO provides high compression ratio, so compression is little slower but supports extremely fast decompression.
Mostly encoding is useful when the declared data type for a column has large capacity than the majority of values actually stored.
Run Length encoding is useful when a column has consecutively repetitive data values. This encoding replaces the consecutively repetitive value in a column with a token that contains the value and number of consecutive occurrences.
Text255 and Text32k encodings are useful for compressing VARCHAR columns only. A separate dictionary is created with unique words in the column value and an index value is associated with each unique word.
Analyze Compression Commands performs compression analysis and makes suggestions. To implement suggestions, we should recreate table.

analyze compression : Analyzes all the tables in current DB
analyze compression table : analyzes the table specified. More than one table cant be specified in this command.
analyze compression table_name column_name : analyzes the column specified. More than one column can be specified.
analyze compression comprows : This is the number of rows <numrows> to be used as the sample size for compression analysis.
analyze compression comprows numrows: numrows is the number of rows to be used as the compression sample size.

Analyze compression acquires table lock.
Analyze compression doesn't consider Run Length encoding for SortKey columns.

Check my next blog on "Key highlights on Temporary Tables and Derived Tables".

Amazon RedShift: Key highlights on "System Tables"

This post has key points on System Tables used in Redshift.

Redshift has following system tables

STL (System Table Logging) - These system tables are generated from log files. Logging tables has STL prefix.
STV (System Table Virtual) - These virtual system tables has snapshots of the current system data. These virtual tables has STV as prefix.
System Views - Subset of data found in STL and STV tables is available in the system views, SVL (System View Logging) and SVV (System View Virtual) respectively.
System Catalog tables - These tables store schema metadata, such as information on tables and columns. These tables have PG prefix.

For system tables to return a table metadata, the schema of the table should be added to the search_path. Below command adds sql_class database to search_path. Once it is done, we can retrieve any table metadata in sql_class database.

set search_path to '$user','public','sql_class';

All system tables exists in pg_catalog database in Redshift.
Some of the Redshift system tables:

pg_table_def: Contains table information like columns, datatypes etc.
pg_aggregate
svv_diskusage : Is used to find the data distribution skew in each table like number of rows stored in each slice for a given table.
svl_statementtext : Checks all statements that used the analyze command.
stv_blocklist : Is used to find how many 1MB blocks of disk space are used for each table
stl_load_commits: we can check the details of the COPY operation.
stl_query: Can find out when a table has been last analyzed
stv_table_perm: Has table Id's.
svl_qlog: Contains elapsed time of queries
svl_query_summary: We can determine if query is writing to disk. If is_diskenabled field is ("t") for any step, then that step wrote data to disk.

View my next post on Key Highlights on Compression.

Amazon RedShift: Key highlights on "Introduction" and "Best Practices For Designing a Table"

Amazon Redshift is a MPP (massive parallel processing) peta-byte scale data warehouse service hosted on Amazon Web Services. Here I have provided some key highlights or important points to remember about Amazon RedShift. To refresh your knowledge on Amazon Redshift the below provided information is wealth.

This post contains highlights on Amazon RedShift Introduction and Best Practices For Designing a Table.

Amazon Redshift is a columnar database.
Tables can be distributed as

Distribution on unique key
Distribution on non-unique key
Distribution Style is ALL
Distribution Style is EVEN
Matching Distribution Keys for co-location of Joins
Big table/Small table joins

Defining a Sort Key helps in performance improvement. Data is stored on disk in sorted order.
Columns that are good candidates to be defined as sort key are

Timestamp or date columns on which we frequently access the recent data
Columns on which we perform range or equality filtering
Columns frequently used for order by, group by or window functions
The joining column, on which two tables can be joined

Columns that are good candidates to be defined as distributed key:

To distribute data evenly among the nodes and slices in a cluster
To collocate data for joins and aggregations

When two tables are frequently joined on a join column, specify that join column as both sort key and distribution key.
RedShift stores columnar data in 1MB disk blocks.
Each block comes with metadata. The min and max values of keys for each block are stored as part of metadata. This helps range or equality queries to quickly identify the block which matches the filter condition.

Analyze command collects statistics

Analyze; -> analyzes all tables in current database
Analyze verboze; -> Analyzes all tables in current database and report progress
analyze table -> analyzes the table specified
analyze table(col1, col2) -> analyzes the columns col1 and col2 in table

Redshift automatically analyzes some tables created with the following commands.

Create table as
create temp table as
select into

Vacuum:

Redshift doesn't automatically reclaim or reuse space that is freed by delete and update operations. These rows are logically deleted but not physically. Vacuum will reclaim the space.
Redshift during update operation marks the old row as delete and inserts a new row. So every update is a delete followed by insert. The sort order might be disturbed by updates.
When delete operation is performed, the row is marked for delete but not removed until Vacuum.

Vacuum operation is time consuming, so its preferred to run it during maintenance window.
Vacuum commands:

vacuum; -> reclaims space of all tables in current database
vacuum table -> reclaims space of specified table.
vacuum sort only table -> resorts rows for table
vacuum delete only table -> claims space from deletes and updates for table

Best Practices for designing tables

Choose the best sort key
Choose the great distribution key
Consider defining primary key and foreign key constraints
Use the smallest possible column size
Use date/time data types for date columns
Specify redundant predicates on the sort column

Primary key, foreign key, unique key constraints are not enforced by Redshift. They are for informational purpose and is used by query planner to generate a better query plan.
RedShift does enforce NOT NULL constraints.
Redshift stores Date and Timestamp data more efficiently than other datatypes.
While performing joins, specify redundant predicates so unwanted blocks are skipped.

To continue, go to my next post on System Tables.

Monday, May 16, 2016

Install Apache Spark on Mac/Linux using prebuilt package

If you do not want to run Apache Spark on Hadoop, then standalone mode is what you are looking for. Here are the steps to install and run Apache Spark on MAC/Linux in standalone mode.

1. Java is a prerequisite for running Apache Spark. Install Java 7 or later. If not present, download Java from here.
If Java is already installed, try the following command to verify Java version

$ java -version

3. Download Scala. Choose the first option of "Download Scala x.y.z. binaries for your system".
Untar the Scala tar file using the following command.

$ tar xvf scala-2.11.8.tgz

4. Use the following commands to move scala directory to /usr/local/scala directory.

$ sudo mv scala-2.11.8 /usr/local/scala
Password:

4. Set PATH for Scala.

$ export PATH=$PATH:/usr/local/scala/bin

5. To check if Scala is working or not, run following command.

$ scala -version
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

6. Apache Spark can be installed in two ways.

Building Spark using SBT
Use prebuilt Spark package

Let's choose a Spark prebuilt package for Hadoop from here. Here we are trying to download spark-1.6.1-bin-hadoop2.6 version. After downloading, spark tar file will be in download folder.
Untar the downloaded tar file using the following command.

$ tar xvf spark-1.6.1-bin-hadoop2.6.tgz

7. Move Spark software files to /usr/local/spark directory

$ sudo mv spark-1.6.1-bin/hadoop2.6 /usr/local/spark
Password:

Set PATH variable to the downloaded spark folder.

$ export PATH=$PATH:/usr/local/spark/bin

8. For testing if Spark is working or not, you can run the following command

$ spark-shell

If Spark is installed successfully, it will find the following output.

Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop 
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
   ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server 
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292. 
Welcome to 
      ____              __ 
     / __/__  ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/  '_/ 
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/  
  
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79) 
Type in expressions to have them evaluated. 
Spark context available as sc.  
scala>

Saturday, May 14, 2016

Install Apache Spark on Windows 10 using prebuilt package

If you do not want to run Apache Spark on Hadoop, then standalone mode is what you are looking for. Here are the steps to install and run Apache Spark on Windows in standalone mode.

1. Java is a prerequisite for running Apache Spark. Install Java 7 or later. If not present, download Java from here.

2. Set JAVA_HOME and PATH variables as environment variables.

3. Download Scala. Then execute the installer. Choose the first option of "Download Scala x.y.z. binaries for your system".

4. Set the environment variables, SCALA_HOME to reflect the bin folder of downloaded scala directory. For example, if you have downloaded Scala in C:\scala directory then set

SCALA_HOME=C:\scala
PATH=C:\scala\bin

Also, you can set _JAVA_OPTIONS environment variable to the value mentioned below to avoid any Java Heap Memory problems you encounter, if any.
_JAVA_OPTIONS=-Xmx512M -Xms512M

5. To check if Scala is working or not, run following command.

>scala -version
Picked up _JAVA_OPTIONS: -Xmx512M -Xms512M
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

6. Spark can be installed in two ways.

Building Spark using SBT
Use prebuilt Spark package

Lets choose a Spark prebuilt package for Hadoop from here. Download and extract it to any drive.
Set SPARK_HOME and PATH environment variable to the downloaded spark folder. For example, if you have downloaded Spark to C:\Spark\spark-1.6.1-bin-hadoop2.6 directory then set

SPARK_HOME=C:\spark\spark-1.6.1-bin-hadoop2.6
PATH=C:\spark\spark-1.6.1-bin-hadoop2.6\bin

7. Though we aren’t using Hadoop with Spark, but somewhere it checks for HADOOP_HOME variable in configuration. So to overcome this error, download winutils.exe and place it in any location.(for example,)
Download winutils.exe for 64 bit.

8. Set HADOOP_HOME to the path of winutils.exe. For example, if you install winutils.exe in D:\winutils\bin\winutils.exe directory then set the path to
HADOOP_HOME=D:\winutils

Set PATH environment variable to include %HADOOP_HOME%\bin as follows
PATH=D:\winutils\bin

9. Grant permissions to the folder C:\tmp\hive if you get any permissions error. \tmp\hive directory on HDFS should be writable. Use the below command to grant the privileges.

D:\spark>D:\winutils\winutils.exe chmod 777 D:\tmp\hive

10. For testing if Spark is working or not, you can run the example from the bin folder of Spark
> bin\run-example SparkPi 10

It should execute the program and will return Pi is roughly 3.14