This post is more of a stream of consciousness about big data technology and the things I do everyday

Big data is great; we all talk about it, and when you run fancy tutorials there are nicely configured virtual machines with point and click (also with GUI’s!) to allow you to get aquainted with big data.

However on the practical side, companies don’t often have nice vendor solutions for you, and often you find yourself staring at a blank terminal screen…

Welcome to big data!

Life isn’t always full of nice IDEs as Jupyter, or RStudio (I found Jupyter to be a huge hassle to get working with Spark). Maybe you’re fortunate to have a nicely configured Zeppellin instance for you to work off. If you have all these tools and they’ve been configured well…then great! You don’t need to read this.

Enter the Console

As an example, of what working in big data feels like for me, try installing vagrant and virtualbox and try out one of the spark boxes:

vagrant init paulovn/spark-base64; vagrant up

Now you can ssh into this:

vagrant ssh

For most developers this isn’t a problem, but for someone who grew up on GUI interfaces this is a massive problem and learning how to navigate this realm is probably the first thing you should learn!

Playing with Spark

Without a doubt the best way to learn is simply by keying things in the console. If you came via Vagrant; I would recommend you learn how to connect PuTTy or WinSCP or similar tools in order to copy files in and have multiple sessions; one screen to run the code and another screen with vim open. This is the usual pattern.

The best way to begin is from the beginning; using:

spark-shell

or

pyspark

Welcome to REPL.

Closing Thoughts

There are many ways to make your life easier. For example, installing Jupyter and using Jupyter notebooks (remember to port forward) will give you something…more interactive!

My favourite way is to run Sparkling Water which can then be access from your local machine (remember to port forward).

After that you should be on your way to learning about big data.

Set your expectations low, and enjoy yourself. No one said it was easy.