Getting started with Spark
The easiest way to play with Spark is to check out the Developer Resources over at databricks, it was the only Windows solution that worked “out-of-the-box” without installing a VM.
A quick look at Spark
Many of the normal functions in Spark are similar to what you expect. We have our typical map
and reduce
functions, but Spark introduces an aggregate
function. How does it work? Consider the example below, which calculates the sum of all the value, and keeps track of how many values have been processed:
scala> val input = sc.parallelize(List(1,2,3,4))
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:12
scala> input.aggregate((0,0)) (
| (acc, value) => (acc._1 + value, acc._2 + 1),
| (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
res0: (Int, Int) = (10,4)
It has three arguments:
zeroValue
: the initial value of the accumulatorseqOp
: this combines the accumulator from the rdd (resilient distributed dataset), the first element is the accumulator and the second is an element of the rddcombOp
: this merges two accumulators, the first and second argument are accumulators.
This function differs from reduce, because you can input and output different values. As you can see from the example above, the input is single values, whilst the output is a tuple.