Monday, May 16, 2016

Install Apache Spark on Mac/Linux using prebuilt package

If you do not want to run Apache Spark on Hadoop, then standalone mode is what you are looking for. Here are the steps to install and run Apache Spark on MAC/Linux in standalone mode.

1. Java is a prerequisite for running Apache Spark. Install Java 7 or later. If not present, download Java from here.
If Java is already installed, try the following command to verify Java version

$ java -version

3. Download Scala. Choose the first option of "Download Scala x.y.z. binaries for your system". 
Untar the Scala tar file using the following command.

$ tar xvf scala-2.11.8.tgz

4. Use the following commands to move scala directory to /usr/local/scala directory.

$ sudo mv scala-2.11.8 /usr/local/scala
Password:

4. Set PATH for Scala.

$ export PATH=$PATH:/usr/local/scala/bin

5. To check if Scala is working or not, run following command.

$ scala -version

Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

6. Apache Spark can be installed in two ways.
  • Building Spark using SBT 
  • Use prebuilt Spark package
Let's choose a Spark prebuilt package for Hadoop from here. Here we are trying to download spark-1.6.1-bin-hadoop2.6 version. After downloading, spark tar file will be in download folder.
Untar the downloaded tar file using the following command.

$ tar xvf spark-1.6.1-bin-hadoop2.6.tgz

7. Move Spark software files to /usr/local/spark directory 

$ sudo mv spark-1.6.1-bin/hadoop2.6 /usr/local/spark
Password:

Set PATH variable to the downloaded spark folder.

$ export PATH=$PATH:/usr/local/spark/bin

8. For testing if Spark is working or not, you can run the following command

$ spark-shell

If Spark is installed successfully, it will find the following output.


Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop 
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
   ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server 
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292. 
Welcome to 
      ____              __ 
     / __/__  ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/  '_/ 
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/  
  
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79) 
Type in expressions to have them evaluated. 
Spark context available as sc.  
scala> 

Saturday, May 14, 2016

Install Apache Spark on Windows 10 using prebuilt package

If you do not want to run Apache Spark on Hadoop, then standalone mode is what you are looking for. Here are the steps to install and run Apache Spark on Windows in standalone mode.

1. Java is a prerequisite for running Apache Spark. Install Java 7 or later. If not present, download   Java from here.

2. Set JAVA_HOME and PATH variables as environment variables.

3. Download Scala. Then execute the installer. Choose the first option of "Download Scala x.y.z. binaries for your system".

4. Set the environment variables, SCALA_HOME to reflect the bin folder of downloaded scala directory. For example, if you have downloaded Scala in C:\scala directory then set

SCALA_HOME=C:\scala
PATH=C:\scala\bin

Also, you can set _JAVA_OPTIONS environment variable to the value mentioned below to avoid any Java Heap Memory problems you encounter, if any.
_JAVA_OPTIONS=-Xmx512M -Xms512M

5. To check if Scala is working or not, run following command.

>scala -version
Picked up _JAVA_OPTIONS: -Xmx512M -Xms512M
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

6. Spark can be installed in two ways.
  • Building Spark using SBT 
  • Use prebuilt Spark package
Lets choose a Spark prebuilt package for Hadoop from here. Download and extract it to any drive.
Set SPARK_HOME and PATH environment variable to the downloaded spark folder. For example, if you have downloaded Spark to C:\Spark\spark-1.6.1-bin-hadoop2.6 directory then  set

SPARK_HOME=C:\spark\spark-1.6.1-bin-hadoop2.6
PATH=C:\spark\spark-1.6.1-bin-hadoop2.6\bin

7. Though we aren’t using Hadoop with Spark, but somewhere it checks for HADOOP_HOME variable in configuration. So to overcome this error, download winutils.exe and place it in any location.(for example,)
Download winutils.exe for 64 bit.

8. Set HADOOP_HOME to the path of winutils.exe. For example, if you install winutils.exe in  D:\winutils\bin\winutils.exe directory then set the path to
HADOOP_HOME=D:\winutils

Set PATH environment variable to include %HADOOP_HOME%\bin as follows
PATH=D:\winutils\bin 

9. Grant permissions to the folder C:\tmp\hive if you get any permissions error. \tmp\hive directory on HDFS should be writable. Use the below command to grant the privileges.

D:\spark>D:\winutils\winutils.exe chmod 777 D:\tmp\hive

10. For testing if Spark is working or not, you can run the example from the bin folder of Spark
> bin\run-example SparkPi 10

It should execute the program and will return Pi is roughly 3.14

Amazon Athena: Key highlights on Amazon Athena

Amazon Athena is a serverless interactive query service that is used to analyze data in Amazon S3 using standard SQL. Amazon Athena app...