Install and submit jobs on Apache Spark on Linux (Ubuntu) and Windows 7

Download Apache Spark from http://spark.apache.org/downloads.html. Choose latest release (today: 1.6.2), package type Pre-built for Hadoop 2.6, direct download spark-1.6.2-bin-hadoop2.6.tgz.

I have JDK 1.8 and Spark extracted locally to c:\dev\env. I have sample spark project on my github https://github.com/bmwieczorek/my-projects.git so that locally it available under: c:\dev\my-projects\my-apache-spark with target\spark-example-0.3-SNAPSHOT.jar build using maven.

In order to avoid uploading 180MB of spark-1.6.2-bin-hadoop2.6\lib\spark-assembly-1.6.2-hadoop2.6.0.jar to hdfs every time I call spark-submit I have set variable SPARK_JAR=hdfs:///user/bdaldr/apps/spark/spark-assembly-1.6.1-hadoop2.6.0.jar poiniting to the spark assemly that I have manually uploaded in advance.

As I am setting master to yarn in spark-submit I need to specify HADOOP_CONF_DIR or YARN_CONF_DIR. In my project I have pointing it to dir with core-site.xml with

<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop01:8020</value>
</property>

(or specify the host and port explicitly in hdfs://hadoop01:8020/user… in spark-submit parameters)
and yarn-site.xml with:

<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop01:8032</value>
</property>

and other yarn properties.
For windows only, In order to get rid of null\bin\winutils.exe warining I have added winutils.exe to my-projects\my-apache-spark\bin and I am setting HADOOP_HOME to the parent folder of bin (so here my-apache-spark).

Windows output (from Intellij terminal but can be also from CMD):

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:\dev>echo %JAVA_HOME%
C:\dev\env\jdk1.8.0_92

C:\dev>java -version
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)

C:\dev>set HADOOP_HOME=C:\dev\my-projects\my-apache-spark

C:\dev>set SPARK_HOME=c:\dev\env\spark-1.6.2-bin-hadoop2.6

C:\dev>set YARN_CONF_DIR=c:\dev\env\hadoop01-hadoop-conf

C:\dev>set SPARK_JAR=hdfs:///user/bdaldr/apps/spark/spark-assembly-1.6.1-hadoop2.6.0.jar

C:\dev>c:\dev\env\spark-1.6.2-bin-hadoop2.6\bin\spark-submit --class com.geekcap.javaworld.sparkexample.WordCount --master yarn --deploy-mode cluster c:\dev\my-projects\my-apache-spark\target\spark-ex
ample-0.3-SNAPSHOT.jar hdfs:///user/bdaldr/bartek/pom.xml
16/07/14 12:27:14 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/10.14.236.209:8032
16/07/14 12:27:15 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
16/07/14 12:27:15 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
16/07/14 12:27:15 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
16/07/14 12:27:15 INFO yarn.Client: Setting up container launch context for our AM
16/07/14 12:27:15 INFO yarn.Client: Setting up the launch environment for our AM container
16/07/14 12:27:15 WARN yarn.Client: SPARK_JAR detected in the system environment. This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/07/14 12:27:15 INFO yarn.Client: Preparing resources for our AM container
16/07/14 12:27:17 WARN yarn.Client: SPARK_JAR detected in the system environment. This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/07/14 12:27:17 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs:/user/bdaldr/apps/spark/spark-assembly-1.6.1-hadoop2.6.0.jar
16/07/14 12:27:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/14 12:27:19 INFO yarn.Client: Uploading resource file:/c:/dev/my-projects/my-apache-spark/target/spark-example-0.3-SNAPSHOT.jar -> hdfs://hadoop01:8020/user/SG0212148/.sparkStaging/applica
tion_1466685646767_2830/spark-example-0.3-SNAPSHOT.jar
16/07/14 12:27:22 INFO yarn.Client: Uploading resource file:/C:/Users/sg0212148/AppData/Local/Temp/spark-07500b5c-eebd-4f46-ab53-b0dec0c009b1/__spark_conf__6038219904436967197.zip -> hdfs://hadoop01:8020/user/SG0212148/.sparkStaging/application_1466685646767_2830/__spark_conf__6038219904436967197.zip
16/07/14 12:27:25 INFO spark.SecurityManager: Changing view acls to: SG0212148
16/07/14 12:27:25 INFO spark.SecurityManager: Changing modify acls to: SG0212148
16/07/14 12:27:25 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(SG0212148); users with modify permissions: Set(SG0212148)
16/07/14 12:27:25 INFO yarn.Client: Submitting application 2830 to ResourceManager
16/07/14 12:27:26 INFO impl.YarnClientImpl: Submitted application application_1466685646767_2830
16/07/14 12:27:27 INFO yarn.Client: Application report for application_1466685646767_2830 (state: ACCEPTED)
16/07/14 12:27:27 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.SG0212148
start time: 1468492046168
final status: UNDEFINED
tracking URL: http://hadoop01:8088/proxy/application_1466685646767_2830/
user: SG0212148
16/07/14 12:27:28 INFO yarn.Client: Application report for application_1466685646767_2830 (state: ACCEPTED)
16/07/14 12:27:30 INFO yarn.Client: Application report for application_1466685646767_2830 (state: RUNNING)
16/07/14 12:27:30 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.14.236.212
ApplicationMaster RPC port: 0
queue: root.SG0212148
start time: 1468492046168
final status: UNDEFINED
tracking URL: http://hadoop01:8088/proxy/application_1466685646767_2830/
user: SG0212148
16/07/14 12:27:31 INFO yarn.Client: Application report for application_1466685646767_2830 (state: RUNNING)
16/07/14 12:27:32 INFO yarn.Client: Application report for application_1466685646767_2830 (state: RUNNING)
16/07/14 12:27:33 INFO yarn.Client: Application report for application_1466685646767_2830 (state: RUNNING)
16/07/14 12:27:34 INFO yarn.Client: Application report for application_1466685646767_2830 (state: RUNNING)
16/07/14 12:27:36 INFO yarn.Client: Application report for application_1466685646767_2830 (state: RUNNING)
16/07/14 12:27:37 INFO yarn.Client: Application report for application_1466685646767_2830 (state: FINISHED)
16/07/14 12:27:37 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.14.236.212
ApplicationMaster RPC port: 0
queue: root.SG0212148
start time: 1468492046168
final status: SUCCEEDED
tracking URL: http://hadoop01:8088/proxy/application_1466685646767_2830/A
user: SG0212148
16/07/14 12:27:37 INFO util.ShutdownHookManager: Shutdown hook called
16/07/14 12:27:37 INFO util.ShutdownHookManager: Deleting directory C:\Users\sg0212148\AppData\Local\Temp\spark-07500b5c-eebd-4f46-ab53-b0dec0c009b1

Here is the result output:

$ ssh bdaldr@hadoop01 "hadoop fs -ls /user/SG0212148"
Found 2 items
drwxr-xr-x - SG0212148 supergroup 0 2016-07-14 07:11 /user/SG0212148/.sparkStaging
drwxr-xr-x - SG0212148 supergroup 0 2016-07-14 07:11 /user/SG0212148/output

$ ssh bdaldr@hadoop01 "hadoop fs -ls /user/SG0212148/output"
Found 3 items
-rw-r--r-- 3 SG0212148 supergroup 0 2016-07-14 07:11 /user/SG0212148/output/_SUCCESS
-rw-r--r-- 3 SG0212148 supergroup 978 2016-07-14 07:11 /user/SG0212148/output/part-00000
-rw-r--r-- 3 SG0212148 supergroup 669 2016-07-14 07:11 /user/SG0212148/output/part-00001

$ ssh bdaldr@hadoop01 "hadoop fs -cat /user/SG0212148/output/part-00000"
(<build>,1)
(<id>copy</id>,1)
(<mainClass>com.geekcap.javaworld.sparkexample.WordCount</mainClass>,1)
(</executions>,1)
(-->,1)
(<groupId>org.apache.spark</groupId>,1)
(</dependency>,1)
(<plugins>,1)
(<url>http://maven.apache.org</url>,1)
(xmlns="http://maven.apache.org/POM/4.0.0",1)
(<version>1.6.2</version>,1)
(</manifest>,1)
(</project>,1)
(<version>2.10</version>,1)
(,888)
(xsi:schemaLocation="http://maven.apache.org/POM/4.0.0,1)
(<archive>,1)
(<version>3.5.1</version>,1)
(<groupId>com.geekcap.javaworld</groupId>,1)
(<source>1.8</source>,1)
(</dependencies>,1)
(<goals>,1)
(<artifactId>maven-compiler-plugin</artifactId>,1)
(<target>1.8</target>,1)
(<artifactId>maven-dependency-plugin</artifactId>,1)
(<version>3.0.2</version>,1)
(<goal>copy-dependencies</goal>,1)
(</plugin>,3)
(<groupId>org.apache.maven.plugins</groupId>,3)
(<packaging>jar</packaging>,1)
(<configuration>,3)
(<phase>install</phase>,1)
(<artifactId>maven-jar-plugin</artifactId>,1)
(<execution>,1)

$ ssh bdaldr@hadoop01 "hadoop fs -cat /user/SG0212148/output/part-00001"
(xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance",1)
(</archive>,1)
(Spark,1)
(http://maven.apache.org/maven-v4_0_0.xsd">,1)
(<addClasspath>true</addClasspath>,1)
(<version>1.0-SNAPSHOT</version>,1)
(<modelVersion>4.0.0</modelVersion>,1)
(</goals>,1)
(</configuration>,3)
(<plugin>,3)
(<manifest>,1)
(<classpathPrefix>lib/</classpathPrefix>,1)
(Import,1)
(<dependency>,1)
(<executions>,1)
(</plugins>,1)
(<artifactId>spark-example</artifactId>,1)
(<dependencies>,1)
(<project,1)
(</build>,1)
(<name>spark-example</name>,1)
(<!--,1)
(<artifactId>spark-core_2.11</artifactId>,1)
(<outputDirectory>${project.build.directory}/lib</outputDirectory>,1)
(</execution>,1)

At the end of each run I need to clear the output folder by:

ssh bdaldr@hadoop01 "hadoop fs -rm -r /user/SG0212148"

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s