Accomplishments – Sept 2016

As a reminder, a one-page summary of all the courses, books & videos
I’ve reviewed in the past year can be found on my Journey Roadmap page.

imageSouth Carolina is hot. We went to a “fall festival” in late September where the autumn temperature peaked at 100 degrees with nary a breeze to be found. For a few moments, I questioned my decision to relocate South, but I’m sure the payoff will occur in the months ahead.

I hit a big milestone this month in that my sabbatical from employment has come to an end and I have begun a job search in earnest. Although I’ve submitted my resume to a number of job postings, I’ve not had any bites yet – even for interviews. I’ve observed that Data Science job openings are in abundance around big cities (NYC, Boston, Washington DC, Chicago, San Francisco). I’m not seeing a ton of opportunity here in South Carolina (Greenville area), but I’m looking!

So here’s how I passed the time in September:

edX – Berkeley U – CS110x Big Data Analysis with Apache Spark

I’ve been pursuing 3 main areas of study:

  • R
  • Python
  • Apache Spark

I learned early on that the areas of Data Science and Big Data are very different, although they are often interchangeably used in mass-media articles. Big Data is the technical feat of handling massive amounts of data (storing and retrieving), in amounts more than any single computer or disk can handle. Data Science are the techniques for deriving useful and actionable information from data (of any size – not just large amounts).

Apache Spark is a very interesting bridge between the two. Like Hadoop, it is an architecture for handling data spread out over many clusters of machines (Big Data) and comes with a rich library of algorithms (Data Science) optimized for running in parallel over these many machines and combining the results together. You can work in Apache Spark using Java, Scala (a java-like language) or Python (pyspark).

This second course provided by Berkeley University on the EdX platform, dives into Machine Learning algorithms using Apache Spark. As in the first course (CS105x Introduction to Apache Spark) the lecture videos were brief and not very in-depth. However, the meat & potatoes were the various interactive labs where 90% of the learning occurs. These labs were extraordinary. They provided links to Spark API documentation where appropriate, walked you through simple examples before sending you on your own to code and provided many checkpoints to make sure you were doing things correctly.

Continue reading