Learning Big Data : Comparison of commands between Apache Hive, Amazon RedShift

Saturday, April 15, 2017

Comparison of commands between Apache Hive, Amazon RedShift

Below is the comparison of SQL commands between Apache Hive and Amazon RedShift.

Create a database sqlcompare:

Hive: Create database/schema [if not exists] sqlcompare;

RedShift: Create database sqlcompare [with owner <ownername>];

Drop the database sqlcompare:

Hive: Drop database/schema [if exists] sqlcompare [cascade];

RedShift: Drop database sqlcompare;

Command to rename column name salary to sal in table employee :

Hive: Alter table employee change salary sal number;
RedShift: Alter table employee rename salary to sal;

Adding a column mgr to a table employee:

Hive: Alter table employee add columns (mgr int);
RedShift: Alter table employee add column mgr int;

Dropping a column mgr from table employee:

Hive: Dropping a column is not directly supported by Hive. We can use replace columns to get the desired result. To drop mgr column from employee table, below replace command includes all columns except mgr.

Alter table employee replace columns (empId Int, empName string, dept string, salary int);

RedShift: Alter table employee drop column mgr;

Renaming a table employee to emp;

Hive: Alter table employee rename to emp;

RedShift: Alter table employee rename to emp;

Inserting a row into table employees:

Hive: Insert into employee values(1,'John','Finance',25000);

RedShift: Insert into employee values(1,'John','Finance',25000);

Insert into new table from parent table:

Hive: Insert into employee_bkp select * from employee;
RedShift: Insert into employee_bkp (select * from employee);

Create table Like:

Hive: Create table emp_bkp like employee;
RedShift: Create table emp_bkp ( like employee);

Get Current schema:

Hive: set hive.cli.print.current.db = true;
Redshift: select current_schema();

Order the result set:

Hive: Hive orders result set in different ways. ORDER BY orders the result set across all reducers. SORT BY orders the result set in each reducer. DISTRIBUTE BY partitions the data and then data is sorted by SORT BY column in each partition. CLUSTER BY partitions the data and orders the result set in ascending order on same column.

select * from employee order by empname;

select * from employee sort by empname;

select * from employee distribute by empname sort by empname;

select * from employee cluster by empname;

Redshift: select * from employee order by empname;

NULL values sort first in ASC mode and last in DESC mode both in Redshift and Hive.

Referencing columns by positional notation:

Hive: To use positional notation for Hive 0.11.0 through 2.1.x, set hive.groupby.orderby.position.alias to true. From Hive 2.2.0 and later, hive.orderby.position.alias is true by default.

select * from employee order by 2 desc;

RedShift: select * from employee order by 2 desc;

Column aliases in where clause:

Hive: Column name aliases cant be referenced in where clause.
RedShift: Column name aliases can be referenced in where clause.
select deptid as dept, avg(salary) avg_sal from employee where dept in (2,4,5);

IN Clause sub-query:

Hive: In Hive, IN-clause is implemented using LEFT SEMI JOIN. In left semi join, columns from left side (employee) of the join can only be referenced in select clause of sql command.

select e.* from employee e left semi join departments d where e.dept_no = d.dept_no;

Redshift: select e.* from employee e where e.dept_no in (select d.dept_no from departments d);

Learning Big Data

Saturday, April 15, 2017

Comparison of commands between Apache Hive, Amazon RedShift

No comments:

Post a Comment

Amazon S3: Basic Concepts

Total Pageviews

Search This Blog