Yet another journey to Cloudera Spark and Hadoop Developer Certification - CCA 175

#spark #cloudera #certification #cca175

Is it worth it?

If you are reading this page, you certainly asked yourself this question. You can approach it from two sides, business and personal.

Firstly, Cloudera itself had a turbulent year, to say the least. There was a merger with Hortonworks, its stock volatility shot through the roof, investor sentiment was not favorable for a while. Hadoop is also more than a decade old and its novelty started to wear off, further pressing down on investors and company executives alike. However Big Data, Artificial Intelligence and Machine Learning are still top priorities for Fortune 500 companies. And Cloudera products have deep roots (tentacles?) now in financial industry, telecommunications, social media and what not. An article by Cloudera chief product officer Arun Murthy Hadoop is Dead. Long Live Hadoop outlines a new, re-architected, cloud-based vision of its Big Data platform. And Apache Spark, a product I am singularly interested in, is as relevant as it gets.

Secondly, would it be beneficial for your career if you have Spark Certification on you resume? And I think the answer is unequivocally yes. Not only because good resources in this area are still scarce, but also preparation for this exam will make you a better developer, and will give you both an understanding of Apache Spark and Hadoop, and a chance to do a hands-on practice on some aspects that you may not encounter in some routine development project.
Having said that, nothing will replace an actual experience of solving real-life problems, and exam (for better or worse) doesn't go beyond basic transformations and doesn't even approach ML. Still it is a good first step in that direction. A step that just might land you a job in this fascinating field.

Exam environment

I intended this article to be very practical and down to earth. So, after somewhat longish introduction, let's dive into the actual content.

Of course, latest information you can get from Cloudera website. I hope that the information that I provided here will stay relevant for a while.

Exam for Cloudera CCA175 Certification is very different from other tests that you'd take in pursuit of Microsoft or Java credentials. First of all, it could be taken from home. Which is in some way more convenient than driving to some accredited center. However you'd need a decent machine, stable internet connection (and better have a backup hotspot via your phone), and a camera. The exam is also very hands-on. There are no multiple-choice questions, no architectural diagrams. It is essentially a lab. You have a task, and you can take any steps to achieve the goal, preferably with Cloudera tools :)
Even though other companies started to incorporate those type of questions/tasks into their certifications, it is still unusual to rely solely on such scenario-based questions.
When you register for an exam, you should also install Chrome plug-in and enable third party-cookies (please disable after exam!). During the exam, you will be connected to a live person, the sole purpose of them is to observe you via your cam to ensure that you are not cheating. A proctor will ask you to show your desk and surroundings. Be sure to remove all papers and gadgets, and be alone in your room. I was asked several times to remove my hand from my face and adjust camera back and forth.
You will be connected to the remote machine in the Cloudera lab. A terminal window is very small, and a default font is tiny. I used external monitor for the exam, as it would be very inconvenient to see it on a 13-inch laptop. Good thing, the font can be increased using Ctrl+ or via terminal's menu. A window with a first question will be opened for you. You can skip questions, jump forward and back. Most important, you can copy-paste (Ctrl-Shift-C/Ctrl-Shift-V) server names, locations etc. And that is what I advise you to do, as it is easy to misspell something, and the whole question will not be counted. Automated grading is used for evaluation, very sensitive to any deviation.
Now open a new terminal, enter "spark2-shell --master yarn", and start!
I used Spark 2 commands, as it is so much easier to achieve the questions' objective, and we are trying to be prepared for the future, right? I also had two terminals open, one for Scoop/Hadoop commands, and one for Spark. But that is up to you.
Do not hurry, in my experience is that there is enough time to solve every task, and do a sanity check and validation after that. I've finished with 40 min to spare. Pay attention to details though, especially to the sample output which is provided for some of the questions. For example, one of the expectation were to have 4 digits after decimal, and initially I've used a float type that had more precision.
Let's finally go to the questions that you can expect during the exam, and how to prepare for them.

Exam questions and preparation

For exam preparation I've used Cloudera Quickstart VM. It comes without Spark 2 support though, only with Spark 1.6, so at the end of the article I've a link to my Github repo where you could find a document describing step-by-step installation of Spark 2.3 on this VM. You will need to dedicate 8Gb of memory for this VM on your machine, or use Azure/Amazon VM.
Based on all evidence that I could gather, there will be 9-10 questions.
Out of them, exactly two will be about Scoop import and export. The rest will be predominantly about Spark. Questions about Hive, Flume, and Impala are most likely retired by this time. You'd need some knowledge of HDFS commands to navigate Hadoop and verify your results.
In the attached repo you could check out a Word document with Spark and Scala cheatsheet, and sample questions and answers both with RDDs and dataframes. As I mentioned, I've took a dataframe route, and do not regret it.

Arun's blog is an absolute killer, this is a go-to resource for your planning. Some points, like Flume, you can safely ignore now, but overall it is very detailed and motivating. He provided several sample scenarios, even though i don't quite agree on some of the approaches to solving them. That's why I've provided my own solutions linked below, as well as some other exercises.

Udemy course Master Big Data: Hadoop & Spark- CCA 175 Preparation by Navdeep Kaur was a great last-minute refresher of knowledge, even though it was more targeted to Spark 1.6. It comes with some practise tests, where I've used Dataframes to solve the scenarios.
The similar course from ITVersity, although hugely popular, takes a much longer route to get to the point, and is skipping all over the place. If you are an absolute beginner, it might be of some benefit to you, otherwise it takes too much time, and might be an overkill. In my opinion, of course.

Finally, a decent book, like "Spark in Action", could provide a high-level overview, and give you a some theory and background about using Spark to handle batch and streaming data. Theory and practise should come together.

Final thoughts

If you are working or planning to work in the field of Big Data, where tools like Scoop and Spark are immensely popular, and Cloudera platform has a huge market share, you would benefit from CCA 175 Spark and Hadoop Developer Certification.
It takes around 40 to 50 hours (1 to 2 months, depending on your schedule) to prepare for this exam from scratch, so it is not overly difficult. We went through exam environment, preparation, and possible questions. Hopefully I've given you enough pointers to go through this test yourself without too much trouble. Take the challenge, and in 2 months you could be a one step above the average developer.

Link to cheatsheet and machine setup

Thanks for reading, here is a cat for you: