Learning Big Data : Amazon Redshift: Key highlights on "SQL Functions" and "Queries"

A very high level overview of Amazon Redshift SQL functions and commands is explained below. This post can be used mainly for refreshing the SQL concepts.

A column alias can be referenced in the where clause.
current_schema() function is a leader_node only function.
SEARCH_PATH:

show search_path; -> shows the search path
set search_path to '$user','public','sql_class','sql_views'; -> adds databases sql_class and sql_views to search path
when an object is created without the schema, it's placed in the first schema listed in the search path. If search path is empty, it returns error.
When identical objects exist in different schemas, the one found first in the search path is the one that will be used.
Objects that are not in any search path schema can only be referenced by using a fully qualified name that also includes the schema name.
pg_catalog system catalog schema is always searched. If it is mentioned in the path, it is searched in the specified order. If pg_catalog is not specified in search path then it is searched before any of the path items.
The current session's temporary schema (pg_temp_nnn) is searched first when referencing temporary table even before referencing pg_catalog schema. Temporary schema is only searched for tables and view names, not for any function names.

Neither IN nor NOT IN can search for NULLS.
Order of logical operators precedence: (), NOT, AND, OR.
BETWEEN and NOT BETWEEN statements are inclusive.
LIKE command works differently on CHAR and VARCHAR.

CHAR data type pads spaces behind them until they reach the size specified while defining. For example, CHAR(10) will pad 10 spaces until it reach 10 characters.

last_name like '%n' will look for values where 10th character is n. All values less than 10 characters will have spaces padded at end and any such rows will not be returned.
to get the results, we can trim spaces and then search for pattern.

trim(last_name) like '%n' will return the expected results

ILIKE command is case insensitive, whereas LIKE command is case sensitive.
All integers are right justified. All integers will start from the right and move left.
All character data is left justified. They will start from left and move to the right.
Default Escape character is \\. '\\_' means _ is no longer a wildcard, but an actual literal underscore.
We can set any character to escape character.

where name like 'ch@_%' escape '@' -> This considers _ as actual literal and @ as escape command.

LIKE command can be replaced with SIMILAR TO. Similar To is computationally expensive compared to LIKE.
DISTINCT vs GROUP BY

Many Duplicates - use group by
Few Duplicates - use distinct
Space exceeded - use group by

TOP command returns top rows from the result set. LIMIT and TOP returns the same result set.
Aggregate commands ignore null values.
count(*) considers null values, count(<col_name>) ignores null values.
Redshift supports non-ansi syntax for joins.
Once a table has been aliased, columns should be referenced using alias not actual table name.
Redshift supports Oracle joins for outer joins i.e. (+) operator. Both these statements are valid

customer as c left outer join orders as o on (c.cust_no = o.cust_no)
cutomer as c , orders as o where c.cust_no = o.cust_no(+)

where clause is performed first to reduce data before performing inner joins.
where clause is performed last after performing outer joins.
ANSI joins should have ON clause. With out ON clause they wont perform cartesian product, instead will error out.
In ANSI, CROSS JOIN performs cartesian product.
In INNER JOIN, WHERE condition defined and the same condition defined in AND clause of ON returns the same result.

custumer cust inner join orders ord ON cust.cust_no = ord.cust_no where cust_name like '%Bi%'; returns same result set as
custumer cust inner join orders ord ON cust.cust_no = ord.cust_no and cust_name like '%Bi%'

In OUTER JOINS, WHERE is performed after the join condition. The additional AND is performed in conjunction with ON statement. So WHERE condition and AND clause in ON doesn't return same results.

custumer cust left outer join orders ord ON cust.cust_no = ord.cust_no where cust_name like '%Bi%'; doesn't return same result set as
custumer cust left outer join orders ord ON cust.cust_no = ord.cust_no and cust_name like '%Bi%'

This query returns all rows from customer table because its left outer on customer. It includes the values from orders table if cust_name matches the condition, else they will be null.

In multi-table joins all ON clauses can be written at the end. But the only way it works is to place the ON clauses backwards. The first ON clause represents the last JOIN and then move backwards.
The output of different Date Functions are:

CURRENT_DATE : Returns in YYYY-MM-DD format
TIMEOFDAY() : Returns weekday, date and time with microseconds.
SYSDATE : Date and time with microseconds.
TRUNC(SYSDATE) : YYYY-MM-DD
GETDATE() : Date and time , no microseconds included
TRUNC(GETDATE()) : YYYY-MM-DD
date_col + 60 : Adds 60 Days to date_col field.
date_col - 60 : subtracts 60 days from date_col.
ADD_MONTHS('2014-04-30',1) : Adds 1 month to date_col and returns last day of month. Returns '2014-05-31 00:00:00'. Add months returns timestamp, by default 00:00:00.
DATEADD(month,1,'2014-04-30') : Returns the corresponding day in the resulting month, not the last day of month. Returns '2014-05-30 00:00:00'. This also returns timestamp.
EXTRACT(hour from current_time) - Extracts hour part from the current_time. Can be used to extract DAY, MONTH, YEAR, HOUR, MINUTE, SECOND.
To use EXTRACT with literal values, use

EXTRACT(hour from TIME '10:02:34') returns 10.
EXTRACT(DAY from DATE '2017-05-04') returns 04.

DATEDIFF(Day, order_date, current_date) : Returns number of days between order_date and current_date. This function uses the datepart (Day, Week, Month, Year etc).
DATE_PART(dow,order_date) : DOW stands for Day Of Week. Returns values between 0 and 6 where 0 stands for Sunday and 6 for Saturday.
DATE_PART(minute,order_date): Returns minutes from order_date. If this expression is used in select statement and an alias is not defined, a default alias "pgdate_part" is assigned.
Date1 - Date2 : Direct subtraction of two date variables results in number of days between them.
MONTHS_BETWEEN(date1,date2) - returns number of months between two date variables.
CURRENT_TIME: Returns currrent time in HH:MI:SS
CURRENT_TIME - 55 - Subtracts 55 seconds from current_time
CURRENT_TIMESTAMP - Returns date and time.
TIMESTAMP(date_var) - TIMESTAMP function can convert date variable or combination of date and time to timestamp.

TIMESTAMP(DATE '2005-01-01', TIME '4:30:33') - 2005-01-01 4:30:33:204331

TO_TIMESTAMP - converts character string to timestamp.
NOW() - Returns current YYYY-MM-DD HH:MI:SS. Same as CURRENT_TIMESTAMP.
AGE(start_date,end_date) - Returns the interval in days, months and years and along with time.
current_date + Interval '1' Day - Adds 1 day to current_date
current_date + Interval '3' month, current_date + Interval '5' Year - adds month and year specified to current_date.
The default for all intervals is 2 digits.
OVERLAPS - Is a boolean function that returns true if two different date/time ranges have common days/time, else it returns false.

Different forms of OLAP functions:

ROWS UNBOUNDED PRECEDING - start calculating from first row and continue till the last row.

Examples:
SUM(sales) over (order by sale_date rows unbounded preceding)
SUM(sales) over (order by product_id, sale_date rows unbounded preceding)
SUM(sales) over (partition by product_id order by product_id, sale_date rows unbounded preceding)

For each partition identified in the dataset, SUM of sales starts over. Calculating the sum of sales starts from first row and continues till the last row in each partition.

ROWS 2 PRECEDING - Gets the moving sum of 3 rows (Moving window of 3 rows). Calculate the Current Row and 2 rows preceding

Examples:
SUM(sales) over (order by product_id,sale_date rows 2 preceding)

RANK() over (order by sales) - Rank() defaults to Ascending order. RANK doesn't have anything in parenthesis.

RANK() over (order by sales desc) - RANK() in descending order. This query orders sales by descending values and assigns rank from 1.
If two values get same rank for example, rank 2 , then the next values skips the 3 and assigns 4.
If where condition is added to sql with rank() OLAP function, the rows are filtered based on where condition after assigning the rank values.

PERCENT_ROWS() over (order by daily_sales) - Gives rank in percentages
COUNT(*) over (partition by product_id order by order_date rows unbounded preceding)
MAX(sales) over (order by product_id,sale_date rows unbounded preceding)
MIN(sales) over (order by product_id,sale_date rows unbounded preceding)
ROW_NUMBER() over (order by product_id,sale_date) - Generates sequence number and increases sequentially.
FIRST_VALUE and LAST_VALUE : These functions allow to specify sorted aggregate groups and return the first and last values in each group.
LEAD and LAG - LAG and LEAD allows to compare different rows of a table by specifying an offset from the current row.

If there is a subquery inside a top query, then subquery runs first.
The IN list ignores duplicate values.
A subquery in IN clause is similar to LEFT SEMI JOIN in Apache Hive.
If we need columns only from one table use subquery in IN clause. If we need columns from both tables then perform a join. Both subquery and join operation join two tables on a common column.
Correlated Subqueries:

The top-query is correlated to bottom query.
The tables in top and bottom queries are given different aliases.
The top query runs first.
For each distinct value from the top query, the bottom query is run.

A correlated subquery can be rewritten as a join with derived table.

When using NOT IN clause, make sure to exclude NULL values else query would return zero rows.
NOT EXISTS is unaffected by a NULL value. Replace NOT IN with NOT EXISTS whereever possible to deal with NULL values.
TRIM(name) and TRIM(BOTH from name) - Both commands trims trailing and leading spaces.
TRIM(TRAILING 'y' from name) - Removes y letter from the end of name, if exists.
TRIM command is case sensitivity.
SUBSTRING(firstname from 2 for 3) - start from 2nd position and go for 3 positions. This is 1-based.
SUBSTRING(firstname from 3) - start from 3rd character till the end.
SUBSTRING(firstname from 0 for 6) - returns from 1st character to 5th character.
SUBSTRNG(firstname from -1 for 4) - returns 2 characters before because the first position starts 2 spaces before.
POSITION('e' in last_name) - returns the position of e letter (1-based) from last name.
CURSOR:

begin; -> This keyword begins a transaction.
declare c_tom cursor for select * from sales; -> This declares a cursor c_tom.
fetch forward 5 from c_tom; -> This fetches first 5 rows from c_tom;
close c_tom; -> Closes the cursor;

NULLIFZERO(quantity) - If quantity is zero, return null. Else return the actual value.
NULLIF(val1,val2) - If val1 and val2 are equal return null. Else return val1.
ZEROIFNULL(col) - If col is null then return zero, else return actual value.
When case statement is not given alias, system assigns <CASE expression> as alias.
CASE statements can be nested.
ORDER BY statements are allowed in the CREATE VIEW statements.
When Order by is used in a select statement on View (view is already created with order by ), the order by defined in select statement is applied. The order by defined in VIEW definition is not considered in this case.
create view allview as select * from baseTable;

The view only includes the columns present when the view was created. Any new column added to base table after view creation is not included in view results though select * is present.
Same applies to deleting a column. If a column exists at the time of view creation and its dropped later, then the view will no longer work though select * is present in view creation.

Redshift allows the user to update the base table though the views.
SET Operators precedence is INTERSECT, UNION and EXCEPT/MINUS.
EXCEPT and MINUS behave similarly in SET operators.
When using SET operators top query will define alias whereas bottom query defines ORDER BY. Here ORDER BY supports only positional notation. Name notion if used will fail.
Statistical Aggregate Functions:

STDDEV
STDDEV_POP
STDDEV_SAMP
SQRT
VAR_SAMP
VAR_POP

Learning Big Data

Wednesday, April 5, 2017

Amazon Redshift: Key highlights on "SQL Functions" and "Queries"

2 comments:

Amazon S3: Basic Concepts

Total Pageviews

Search This Blog