Accomplishments – May 2016

As a reminder, a one-page summary of all the courses, books & videos I’ve reviewed in the past year can be found on my Journey Roadmap page.

The highlight of May was attending ODSC East (Open Data Science Conference) in Boston. I wrote extensively about it in my previous blog post, so I won’t repeat anything about it here.

May provided a few more steps forward in my Data Science knowledge. I also have spent a good portion of June emptying my house in preparation for a sale, for which I now wait. And wait. And wait…

I’ll be increasing this blog to a weekly frequency, with some topics of interest like:

  • Note-taking methods for MOOCs
  • Using Mnemosyne to stay sharp on techniques
  • The power of multi-core processing in R
  • Blogs I follow
  • … and whatever else strikes my fancy

But for now, I’m well overdue for my May recap, so without further ado:

Completed Items

Coursera – Machine Learning: Classification

This was the third course in the 6-part Machine Learning series developed by the University of Washington. Once again, the team of Emily Fox (professor at UoW) and Carlos Guestrin (founder & CEO of Dato) hit a home run with this Python-based course.

imageThe topics covered were standard in the realm of Classification:

  • Linear & Logistic Classifiers
  • Overfitting
  • Decision Trees
  • Handling missing data
  • Boosting & AdaBoost
  • Precision & Recall
  • Huge datasets and Stochastic Gradient Descent

As with the previous courses, there were very relevant online quizzes to test your comprehension of the material. Also, there were engaging iPython notebooks that reviewed the lecture material and walked you through Python code to implement the algorithms discussed as well as explore how those algorithms handle various extreme situations using real data.


I’m not an experienced Python developer, so while I did find these assignments challenging, they provided a good amount of skeletal code as well as checkpoints to test the code and I was able to successfully complete all assignments. More importantly, getting my hands dirty with the algorithms truly solidified the theory covered in the lectures.

The fourth course in the series, Clustering & Retrieval, begins next week and I cannot wait!

Getting Started with Python Data Analysis


by Phuong Vo.T.H, Martin Czygan

Back in January, I purchased nearly a dozen books from Packt Publishing as they held a post-holiday sale where you could get unlimited electronic books (various formats, including non-DRM’d pdf) for $5 each. Let me tell you, I went to town.

Well, I can see why these books were so inexpensive.  They’re unreviewed, non-proofread whitepapers written by authors looking to pad their resumes as being “published authors”.

Getting Started with Python Data Analysis” had a promising synopsis:

With this book, we will get you started with Python data analysis and show you what its advantages are.

Topics covered included:

  • pandas
  • numpy
  • matplotlib
  • various data retrieval libraries
  • scikitlearn

Unfortunately, the book was riddled with code samples that would not run. For example:

from matplotlib import pyplot as plt
import numpy as np
x = np.linspace(0, 3, 6)
y = np.power(x,2)
line = plt.plot(y, color='red', linewidth=2.0)
plt.setp(line, marker='o')

generated an error with the line.set_linestyle function. Instead, after reviewing a different Python book, I was able to get the snippet to run with the following:

from matplotlib import pyplot as plt
import numpy as np
x = np.linspace(0, 3, 6)
y = np.power(x,2)
line = plt.plot(y, color='red', linewidth=2.0, linestyle='--')
plt.setp(line, marker='o')

There’s little awareness of what topics have been covered and you will see topics dropped in out of the blue without reference. For example the following contour plot code introduces np.newaxis, something not referenced prior to or subsequent to this snippet.


Finally, the spelling mistakes were really distracting. Below, the spelling of “unequal” in the graph is bad enough – but the fact that it’s also misspelled differently in the code caused me a bit of head-scratching.



Even despite the typos and code errors. content-wise the material is light on each subject. Give this book a pass. I’ll let you know if I uncover something better.

Taming Big Data with Apache Spark and Python – Hands On!

At ODSC, I heard quite a bit about Spark. I’m only passingly familiar with Big Data topics like Hadoop and MapReduce, but I got the understanding that Apache Spark is an exciting new platform that bridges the world of big data storage with data science analysis, bringing to bear much of the techniques I’ve been learning without necessarily having to master the world of clustered data storage and retrieval.

imageFor $25, I tried out the Udemy Course “Taming Big Data with Apache Spark and Python” and I truly believe I got more than my money’s worth out of this course.

For a hands-on introduction to Spark, I don’t see how this can be beat. You are walked through how to install Spark locally, you’re given some sample data to work with and you have your first Python-based Spark script up and running in no time. You will need to have had some experience with Python (certainly not expert-level) to get the most out of the course.

Throughout the 46 videos, the instructor (Frank Kane) does an excellent job of introducing the topic, providing theory (“here’s what we’re gonna do”), stepping through the code (“here’s how we’re going to do it”) and recapping (“here’s what we just did”).

There weren’t any quizzes or formal graded assignments like you would find in your typical Coursera or edX course, but that was just fine here. You were certainly given “try it yourself” exercises whose solutions were covered in a subsequent video.

I found the material easily accessible for the most part although I did struggle on the advanced topic of network graphs (finding friend connections iteratively). At the end, you attempt to set up an Amazon Web Service instance for trying out Spark on a much larger data set (as opposed to the small samples you practice with on your laptop). However, despite setting everything up successfully, I kept running into memory errors with the AWS instance and couldn’t get the script to run successfully. Debugging would have required more advanced understanding of Spark than this course provided.

All-in-all, I recommend this course as a great starting point.

Looking ahead to June

Well, technically, there’s only a week left in June as I write this. And I did take a 2-week hiatus from data science learning to deal with prepping my house for sale. However, I am in the midst of the following:

  • Continuing to pore through An Introduction to Statistical Learning with Applications in R, which has become my “bible” for machine learning techniques.
  • I’ve been interested enough in Spark to pursue a second course in the subject. On the edX platform, “BerkeleyX: CS105x Introduction to Apache Spark” kicked off last week and it looks very promising. Through the generosity of Databricks, all students are given a free account to set up a 6-GB Spark instance to be used for the course.
  • I’m taking a step back to review some old material. Next week, the fourth course in UoW’s 6-part Machine Learning series starts up, and I’m looking back at the notes & code for the 2nd & 3rd courses.
  • I’m continuing to putter around with the JHU Data Science Capstone on the subject of Natural Language Processing. I keep stalling out in Week #2 as other items constantly take higher priority. I’m most likely going to have to re-register in July and finish this off for good.
  • I’m also spending more time honing my skills in R and Python through the use of Mnemosyne. I’ll have a full blog on that subject in a couple of week.

As always, I look forward to your comments!