Trying to figure out how to install Spark on windows is a bit of a pain so here are some basic instructions:
- Download any of the prebuilt hadoop distribution for spark. For me downloading it for Hadoop version 2.6 worked perfectly.
- Download the winutils for Hadoop. Google should bring up several results. At the time of writing this one worked fine.
Now we have to make sure all our environmental variables arre set up correctly. We need to have the following things installed with the appropriate variables set:
HADOOP_HOME
: for winutilsPYTHON_PATH
: for PythonSPARK_HOME
: for SparkR_HOME
: for R
Once all of these are added to our path, then we can simply run PySpark/SparkR in the shell in the normal manner.
For example, your environmental variables may look something like this:
set HADOOP_HOME=C:\Users\chapm\HADOOP
set PYTHON_PATH="C:\Users\chapm\Python"
set SPARK_HOME=C:\Users\chapm\Spark
set R_HOME="C:\Users\chapm\R-3.X.X\bin"
set PATH=%PATH%;%JAVA_HOME%\BIN;%PYTHON_PATH%;%HADOOP_HOME%;%SPARK_HOME%\bin;%R_HOME%;%PYTHON_PATH%\Scripts
If you run pyspark
or sparkr
in the command line the pyspark
shell should now work.
Getting RStudio working with Spark
To get RStudio working with Spark, we only need to link the library within Spark with the R Script.
If we simply run the following code, SparkR should work with RStudio:
Sys.setenv(SPARK_HOME="C:/Users/chapm/Spark")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
# load sparkR
library(SparkR)
# Initialize SparkContext and SQLContext
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
Getting IPython working with Spark
The easiest way is to create a new profile with the correct start up options.
ipython create profile pyspark
Within the startup folder, we can create a script (00-XX.py
) with the following:
import sys
spark_home = "C:/Users/chapm/Spark"
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, spark_home + '/python/lib/py4j-0.8.2.1-src.zip')
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(spark_home+ '/python/pyspark/shell.py')
Now running ipython notebook --profile=pyspark
should launch your notebook with Spark enabled.