|As a reminder, a one-page summary of all the courses, books & videos
I’ve reviewed in the past year can be found on my Journey Roadmap page.
It’s been a summer of incredible transition for me as I’ve made a permanent move from the relatively chilly climate of New York (old house shown to the right) to the equatorial heat misery of South Carolina. I can only hope that this investment pays off in the winter when I’m enjoying a balmy 50-degree day while the Northeast shovels out of a blizzard.
I’ve not posted an “Accomplishments” blog since May, but that certainly shouldn’t indicate that I’ve not been pursuing Data Science over the summer. Far from it! Although I hadn’t completed any new courses or books in June and July, when I wasn’t busy packing up or tossing out all of my life’s possessions, I took advantage of the time to revisit a lot of the topics I’d covered in the past year. I began creating hundreds of Mnemosyne flashcards to sharpen my skillset. I retook the UoW Machine Learning: Regression Course, going over all code examples in painstaking detail. I also re-read every word of “An Introduction to Statistical Learning with Applications in R”, working through all of R labs and exercises, incorporating sample code into my Mnemosyne card set. It was an absolutely necessary activity, and I feel much stronger as a result. Consider revisiting some old courses you’ve taken – you’d be surprised that you can still get something new from them with multiple tries.
August, however, with the move complete, a number of endeavors also came to a successful close.
Coursera – Machine Learning: Clustering and Retrieval
This is the fourth course in the University of Washington Machine Learning Specialization on Coursera. Grouping and association were the theme here. Diving into large datasets of Wikipedia article entries, we found commonality between groups of articles, implemented various measures of “alikeness”, assigned articles to topics based on word groupings and made predictions on new articles based on models build from large training sets.
By far, this was the most challenging course of the series to date. It covered a number of topics I’ve seen before, such as Nearest-Neighbor searches, k-means Clustering and dendrograms. But they also went into depth on a number of topics that I’d not seen before. These included
- Locality sensitive hashing for approximate NN search
- KD trees
- Mixed Gaussian models
- The EM algorithm
- Latent Dirichlet allocation and Gibbs sampling
There was a lot of ground covered and the message boards did reflect some frustration with how much content was packed into each week. From my point of view, for $79, the more content the better! I honestly feel like I grasped about 80% of some of the advanced concepts, but I do feel like I was left with a solid foundation from which to pursue further inquiry.
The assignments, as usual, were self-guided iPython notebooks that walked you through the process of implementing a number of the mathematical algorithms with Python code. If taken slowly enough and with constant reference to notes taken during the lectures, they were achievable and added to the education experience.
I will observe that questions posted in the forums went unanswered far more frequently than in previous courses. While there were fellow students who were incredibly helpful, you could find yourself on your own if you’re seeing assistance on a question.
Finally, there was a bit of drama during this course in that the company founded by one of the course creators Turi (formerly Dato, formerly GraphLab) was acquired by Apple. It was questioned whether the primary product used in this course, Graphlab Create, would remain available to students or whether the professors themselves would even deliver on the remaining two courses (Recommender Systems & Dimension Reduction as well as the Capstone Project). If I sold a company off to Apple, I might just take my bags of cash to Tahiti and live out my life sipping Banana Daiquiris.
edX – Berkeley U – CS105x Introduction to Apache Spark
My studies in the past year have followed 4 different tracks: R, Python, Mathematics and the newcomer, Apache Spark. I initially was going to stay focused on the data analysis and algorithm techniques and not dive into the Big Data world (which is its own complementary specialty). However, out of curiosity, I took the short Udemy course, Taming Big Data with Apache Spark and Python back in May and found myself pleasantly surprised at how much I was intrigued by this technology.
The University of California, Berkeley has developed their own edX trilogy around this platform called Data Science and Engineering with Apache Spark. They’ve partnered with Databricks to bring students free access to Spark servers hosted on Amazon Web Services for a year.
The first course I completed was CS105x Introduction to Apache Spark. There are two parts to this 3-week course (although you are given 6 weeks to complete all the material at your own pace):
- Short video lectures are given by Dr. Anthony D. Joseph, who has the most awkward teleprompter reading style. He pretty much reads the slides to you and doesn’t add much from his personal appearance. The content is very basic, and served as a nice counterbalance to the brain bruiser UoW Machine Learning course I was taking at the same time. You have to answer quiz questions based on the lectures, and they seemed more based on trivia contained in the lecture than on any real core concept.
- Fortunately, these videos were only a small part of what was offered. The true value were in the self-paced iPython Notebook labs. These labs are where the real learning took place.
They truly guide you from basic Spark commands (for which you needed some familiarity with Python), small working examples processing through the entire works of Shakespeare to the final lab where you analyzed a month’s worth of NASA web logs.
I found these labs to be very clear with a straightforward progression of skillsets. By the end, you are cut free to perform your own analysis on a large dataset.
This was one of the few MOOC’s I’ve encountered where there was very active participation from the course staff in the message boards. Most questions appeared to get answered in a timely manner which is something I just haven’t seen elsewhere.
The biggest complaint about this course was the lab grading mechanism. While you worked through the steps, the platform had checks along the way to let you know whether you had done it correctly – and that part was fantastic! But at the end, you will need some incredible patience to jump through the hoops of the final submission process. You have to copy & paste various codes from one website to another website to yet another website. Aside from this test of patience, I felt I got more than my $49 worth on this course and immediately registered for the second course in this series, CS110x Big Data Analysis with Apache Spark, which promises to move beyond the basics and dive into the Machine Learning libraries and other tools built into Spark.
The Signal and the Noise: Why So Many Predictions Fail – but Some Don’t (Nate Silver)
This is the book about data analysis that’s meant for the general public. Written by Nate Silver (founder of fivethirtyeight.com and famous for his accurate calling of the 2012 presidential election) it’s a whirlwind tour of a variety of topics that all have a common theme: Bayesian thinking is the way to go. In this book, he covers:
- Online poker
- Political television pundits
- Earthquake prediction
- Computerized chess
- Climate change
- Weather forecasting
It’s one of those books for which you just feel smarter for having read. Sometimes, while deep in equations and algorithms, you lose sight of why it is we use these techniques and have developed the technology to handle the volume of data we can now collect. This books brings the forest right back into focus, and paints it a beautiful green in the process. I highly recommend this book.
On the deck for September
- edX – Berkeley U – CS110x Big Data Analysis with Apache Spark
- UoW Machine Learning: Recommender Systems & Dimensionality Reduction
- Python Machine Learning
- Johns Hopkins Data Science Capstone – and I will complete it this time around!
- …and become active in Kaggle. Time to get cracking at some datasets!
My next post will be of somewhat of a personal nature. Now that I’ve relocated to South Carolina, I’m ready to begin pursuing the next phase of my career. I’ll share with you a little of my history, as well as my fears and hopes for the future.