Learning Big Data : 2016

Monday, May 16, 2016

Install Apache Spark on Mac/Linux using prebuilt package

If you do not want to run Apache Spark on Hadoop, then standalone mode is what you are looking for. Here are the steps to install and run Apache Spark on MAC/Linux in standalone mode.

1. Java is a prerequisite for running Apache Spark. Install Java 7 or later. If not present, download Java from here.
If Java is already installed, try the following command to verify Java version

$ java -version

3. Download Scala. Choose the first option of "Download Scala x.y.z. binaries for your system".
Untar the Scala tar file using the following command.

$ tar xvf scala-2.11.8.tgz

4. Use the following commands to move scala directory to /usr/local/scala directory.

$ sudo mv scala-2.11.8 /usr/local/scala
Password:

4. Set PATH for Scala.

$ export PATH=$PATH:/usr/local/scala/bin

5. To check if Scala is working or not, run following command.

$ scala -version
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

6. Apache Spark can be installed in two ways.

Building Spark using SBT
Use prebuilt Spark package

Let's choose a Spark prebuilt package for Hadoop from here. Here we are trying to download spark-1.6.1-bin-hadoop2.6 version. After downloading, spark tar file will be in download folder.
Untar the downloaded tar file using the following command.

$ tar xvf spark-1.6.1-bin-hadoop2.6.tgz

7. Move Spark software files to /usr/local/spark directory

$ sudo mv spark-1.6.1-bin/hadoop2.6 /usr/local/spark
Password:

Set PATH variable to the downloaded spark folder.

$ export PATH=$PATH:/usr/local/spark/bin

8. For testing if Spark is working or not, you can run the following command

$ spark-shell

If Spark is installed successfully, it will find the following output.

Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop 
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
   ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server 
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292. 
Welcome to 
      ____              __ 
     / __/__  ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/  '_/ 
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/  
  
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79) 
Type in expressions to have them evaluated. 
Spark context available as sc.  
scala>

Saturday, May 14, 2016

Install Apache Spark on Windows 10 using prebuilt package

If you do not want to run Apache Spark on Hadoop, then standalone mode is what you are looking for. Here are the steps to install and run Apache Spark on Windows in standalone mode.

1. Java is a prerequisite for running Apache Spark. Install Java 7 or later. If not present, download Java from here.

2. Set JAVA_HOME and PATH variables as environment variables.

3. Download Scala. Then execute the installer. Choose the first option of "Download Scala x.y.z. binaries for your system".

4. Set the environment variables, SCALA_HOME to reflect the bin folder of downloaded scala directory. For example, if you have downloaded Scala in C:\scala directory then set

SCALA_HOME=C:\scala
PATH=C:\scala\bin

Also, you can set _JAVA_OPTIONS environment variable to the value mentioned below to avoid any Java Heap Memory problems you encounter, if any.
_JAVA_OPTIONS=-Xmx512M -Xms512M

5. To check if Scala is working or not, run following command.

>scala -version
Picked up _JAVA_OPTIONS: -Xmx512M -Xms512M
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

6. Spark can be installed in two ways.

Building Spark using SBT
Use prebuilt Spark package

Lets choose a Spark prebuilt package for Hadoop from here. Download and extract it to any drive.
Set SPARK_HOME and PATH environment variable to the downloaded spark folder. For example, if you have downloaded Spark to C:\Spark\spark-1.6.1-bin-hadoop2.6 directory then set

SPARK_HOME=C:\spark\spark-1.6.1-bin-hadoop2.6
PATH=C:\spark\spark-1.6.1-bin-hadoop2.6\bin

7. Though we aren’t using Hadoop with Spark, but somewhere it checks for HADOOP_HOME variable in configuration. So to overcome this error, download winutils.exe and place it in any location.(for example,)
Download winutils.exe for 64 bit.

8. Set HADOOP_HOME to the path of winutils.exe. For example, if you install winutils.exe in D:\winutils\bin\winutils.exe directory then set the path to
HADOOP_HOME=D:\winutils

Set PATH environment variable to include %HADOOP_HOME%\bin as follows
PATH=D:\winutils\bin

9. Grant permissions to the folder C:\tmp\hive if you get any permissions error. \tmp\hive directory on HDFS should be writable. Use the below command to grant the privileges.

D:\spark>D:\winutils\winutils.exe chmod 777 D:\tmp\hive

10. For testing if Spark is working or not, you can run the example from the bin folder of Spark
> bin\run-example SparkPi 10

It should execute the program and will return Pi is roughly 3.14

Sunday, April 24, 2016

ERROR : brew could not symlink, /usr/local/sbin is not writable.

If you encounter the following error while installing with brew, it could probably be due to lack of proper permissions.

Error that encountered during installation of hadoop on MAC:

=======================================================================

$ brew install hadoop

==> Downloading https://www.apache.org/dyn/closer.cgi?path=hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz

==> Best Mirror http://mirror.symnds.com/software/Apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz

######################################################################## 100.0%

Error: The `brew link` step did not complete successfully

The formula built, but is not symlinked into /usr/local

Could not symlink sbin/distribute-exclude.sh

/usr/local/sbin is not writable.

You can try again using:

brew link hadoop

==> Caveats

In Hadoop's config file:

/usr/local/Cellar/hadoop/2.7.2/libexec/etc/hadoop/hadoop-env.sh,

/usr/local/Cellar/hadoop/2.7.2/libexec/etc/hadoop/mapred-env.sh and

/usr/local/Cellar/hadoop/2.7.2/libexec/etc/hadoop/yarn-env.sh

$JAVA_HOME has been set to be the output of:

/usr/libexec/java_home

==> Summary

/usr/local/Cellar/hadoop/2.7.2: 6,304 files, 309.8M, built in 13 minutes 35 seconds

======================================================================

Run the following commands to grant the privileges

$ sudo chown -R $(whoami) /usr/local

$ sudo chown -R $(whoami) /Library/Caches/Homebrew

Once the privileges are granted, now hadoop has to be just linked since it is already installed.

$ brew link hadoop

Linking /usr/local/Cellar/hadoop/2.7.2... 27 symlinks created

Friday, April 15, 2016

MongoDB Questions and Answers - Operators

1. $cmp comparison operator returns ______ as output.
a. Boolean
b. Number

2. All comparison expressions returns boolean except for $cmp.
a. True
b. False

3. Number of arguments required for the comparison aggregation operators
a. 1
b. 2
c. 3
d. 4

4. Comparison operators compare _____
a. Only value
b. Only type
c. Both value and type
d. None of the above

5. Choose the comparison operators below. Select all that apply.
a. $ne
b. $cmp
c. $in
d. $all

6. Arithmetic aggregation operators perform mathematical operators only on numbers.
a. True
b. False

7. String expressions, with the exception of _____ have a well defined behavior for strings of ASCII characters.
a. $strcasecmp
b. $toUpper
c. $toLower
d. $concat

8. Aggregation operator to access the text search metadata is
a. $text
b. $meta
c. $let
d. $slice

9. The following command returns the milli seconds portion of a date
a. $millis
b. $milliseconds
c. $millisecond
d. $milli

10. Choose the ternary operator.
a. $ifNull
b. $multiply
c. $cond
d. $sum

Answers
--------------

1. b
2. a
3. b
4. c
5. a,b
6. b
7. d
8. b
9. c
10. c

Tuesday, April 12, 2016

MongoDB Sample Questions and Answers - Packages

1. Mongodump doesn't dump the contents of the following database.
a) system
b) test
c) local
d) All of the above

2.___________ manipulate files stored in your MongoDB instance in GridFS.
a) mongorestore
b) mongofiles
c) mongosupport
d) None of the mentioned

3. __________ is a command-line utility to import content from a JSON, CSV, or TSV.

a) mongorestore
b) mongofiles
c) mongosupport
d) mongoimport

4. Which of the following is used for creating binary export of the contents in MongoDB?
a) mongodump
b) mongofiles
c) mongosupport
d) mongoimport

5. Which command line utility works on GridFS?
a) mongodump
b) mongofiles
c) mongosupport
d) mongoimport

6. _____________ is a native OS-X-application for MongoDB management.
a) Apricot
b) MongoHub
c) Mongo
d) 3T MongoChe

7.___________ is a routing service for MongoDB shard configurations that processes queries from the application layer.
a) mongod
b) mongos
c) mongo
d) None of the mentioned

8.___________ is the primary daemon process for the MongoDB system.
a) mongos
b) mongod
c) logpath
d) syspathlog

9. Mongo Shell is an interactive ______ interface.
a) C++
b) Python
c) Java Script
d) Java

10. _____ provides statistics at per collection level.
a) mongostat
b) mongotop
c) mongofiles
d) mongoperf

11. ________ provides overall status of currently running mongod and mongos instances.
a) mongostat
b) mongotop
c) mongofiles
d) mongoperf

Answers
-------------
1. c
2. b
3. d
4. a
5. b
6. b
7. b
8. b
9. c
10. b
11. a

Wednesday, April 6, 2016

MongoDB Sample Questions and Answers - Indexes

Here are some of the sample questions and answers on Performance in MongoDB.

1. How many indexes are allowed per collection in MongoDB?
a. 31
b. 32
c. 64
d. 60

2. Below operators are bad candidates for indexes? Choose all the possible answers.
a. $ne
b. $nin
c. $eq
d. $not

3. If an index is created on existing collection, it doesn't validate the already existing data.
a. True
b. False

4. Text indexes are sparse by default.
a. True
b. False

5. system.profile collection is a Capped collection.
a. True
b. False

6. Text indexes don't support compound indexes. They are always single field indexes.
a. True
b. False

7. Can we build multiple background indexes in parallel?
a. Yes
b. No

8. Can we have multiple foreground indexes building in parallel?
a. Yes
b. No

9. Point out the wrong statement
a. Query selectivity refers to how well the query predicate excludes or filters out documents in a collection
b. Query selectivity can determine whether or not queries can use indexes effectively or even use indexes at all
c. More selective queries match a larger percentage of documents
d. All of the mentioned

10. Point out the wrong statement
a. Multikey indexes are always single field indexes.
b. TTL indexes are always single field indexes.
c. Hash indexes doesn't support multi key indexes.
d. Multikey indexes can't be used for sharding.

Answers
-----------------
1. c
2. a,b,d
3. b
4. a
5. a
6. b
7. a
8. b
9. c
10. a

Learning Big Data