Note to future self: A month with major holidays in it (i.e. Thanksgiving) is not the time to load up on four online courses simultaneously along with whatever self-learning plans you decide to make. I had to apologize for bringing my laptop to the Turkey Table, but I needed every second to get through this month! Lesson learned and I’m going to ease up a bit in December.
So – here’s how I progressed forward in November:
Coursera – Data Analysis and Statistical Inference
Let’s start with the absolute best online education experience I’ve had so far: the 9-week long soup-to-nuts introduction to Probability and Statistics given in Coursera’s Data Analysis and Statistical Inference course put together by Duke University. Take a look at the Syllabus on their course page to see the list of topics covered.
This is where I’d recommend anyone who is serious about Data Science to begin. You just can’t pursue this without mathematics and this course will give you a very solid foundation.
My undergraduate degree (C’93) was Mathematics and Physics. But I learned more about statistical methods through this course than I ever did in my undergraduate and graduate studies. These 9 weeks were intense. There were 7 units worth of material, quizzes, labs, optional practice homework assignments, a mid-term exam, a final exam and a course-long project (you can read mine – Inventory Damage Analysis) to demonstrate your new-found knowledge and analyze actual data. Fortunately, there were built-in breaks (7 units over 9 weeks) that allowed you to get caught up if needed or have extra time to review the material.
The lecture videos were professionally produced (hats off to Dr. Mine Çetinkaya-Rundel and her team), the logical progression of topics felt natural, the labs & quizzes enhanced my understanding (rather than feeling trivial or unrelated) and the exams were challenging without being impossible. All in all, this is a course you definitely should take if you’re reading this blog and heading on the same journey.
Coursera – Statistical Inference
This is the 6th of 9 courses in John Hopkins’ Data Science Specialization. So far, I’ve been very happy with the courses in this series with the previous course on Reproducible Research being the highlight to date.
However, there is a bit of a stumble with the Statistical Inference class. It covered a lot of the same material as found in the Duke University course (described above) but felt very rushed and was a lot less polished.
Topics included Probability, Bayes’ Theorem, Normal & Poisson distribution, Confidence Intervals, Hypothesis tests, Power and Bootstraps methodology. Although R was used as the primary demonstration tool, make no mistake: this is a mathematics course. If I hadn’t already been deep into the Duke course, I would’ve struggled mightily with this offering.
I found the practice homework assignments with explanations very helpful for the quizzes and the assigned project did help gel a number of the concepts (you can see my submissions Exploration of the Exponential Distribution and Analysis of Tooth Growth Data).
My biggest complaint (aside from the rushed pace) was the sloppiness of some aspects of this course. The lecture videos did not perfectly align with the provided notes. In fact, one of the provided pdfs had several slides devoted to a topic (the Jackknife) for which there was no associated video and I ended up discovering it on YouTube. The slides themselves were riddled with spelling errors. While I understand that scientists, mathematicians and technologists are notoriously bad spellers, it was distracting and I just wanted to personally offer the instructor personal tutoring on how to use SpellCheck.
Coursera – Data Manipulation at Scale: Systems and Algorithms
I wrote an early review last month of the Data Science at Scale Specialization offering at Coursera given by University of Washington. I’ll repeat what I wrote as nothing has changed after having completed this 1st offering in their specialization.
In looking at the subject matter from the Data Science at Scale specialization (MapReduce, Hadoop, Parallelization, etc) I thought it would be a nice complement to get a “big picture” view of the current state of technology.
What you get instead is a very sloppy, typo-ridden, outdated (videos are from 2013), disjointed set of video lectures from a professor who is obviously smart & well versed in the subject, but is a lousy educator. He represents the type of professor whose class you skip because you can learn more from the book than you can from him.
The exercises have minimal direct relevance to the lecture material and assume you already have proficiency in R, SQL and Python (which, fortunately, I do – but I highly empathized with those who didn’t).
In one of the online exercises, you are asked to randomize a set of data and then provide the mean of a particular variable in that data set. Well, common sense would tell you that every one is going to get a slightly different answer. But whoever set up the online assignment thought that only HIS answer could be correct and every student on the discussion forums reported they could not get this marked correctly. Unforgivable is the fact that there was no “community TA” patrolling the message boards looking for reported issues like I saw in all of the other Coursera offerings. This entire program seems… abandoned.
Don’t waste your money on this one.
Update: Well after the deadline passed on the online exercise, the professor did get on the message boards, apologized for being too overwhelmed to pay attention to what’s happening in the course and corrected the quiz to allow for a range of means resulting from a random sampling.
He attempted to cover a lot of different technologies and techniques in this course (25 different lectures series, each with 4 – 5 videos) but ultimately failed to instill anything substantial. And the constant correcting of the slide’s typos, right in the middle of his lecture, was so damn… unprofessional. How hard would it be to pause the recording, correct the slides, and continue?
Hands-on Programming with R
I watched a free online webinar about R given by Garrett Grolemund and was impressed by it enough to acquire his book Hands-On Programming with R: Write Your Own Functions and Simulations. It’s a relatively short book and is meant not only for those new to R, but also those new to programming in general.
There are plenty of introduction to programming books out there that use Java, Python or some other language as the foundation. But one doesn’t really think of R as a “language” but more as a tool to do some math stuff. But it is indeed a fully developed language with a rich syntax and this book fills a gap focusing on the programming aspects whereas all of the other books in my collection are about libraries, algorithms, graphics and other techniques in R.
So while the beginning of the book may be too basic for an experience programmer who knows what variables are and have written a conditional statement or two in their day, there were two sections that I found were more than worth the cost of this book:
- An in-depth discussion of how R does its variable scoping
- Performance tuning with a focus on vectorized processes – a concept that’s just not found in Java and other languages.
You get hands-on experience creating a couple of large, intricate programs (a deck of cards shuffler and a slot machine with complex payout rules) that make use of the concepts. Well done, Garret!
SAP HANA – As much as I love this technology and hoped to be able to swim in the world of this in-memory database, I learned that it was no longer in my company’s priority list for 2016. So I’ve abandoned the books I was reading and the hands-on exploration of SAP HANA One (the HANA offering on Amazon Web Services). Maybe I will pick it up again someday, but for now I dropped it to allow time for other topics.
Coursera – Practical Predictive Analytics: Models and Methods – This is the second offering in the “Data Science at Scale Specialization” that I discussed in length above. To my surprise, it started one week after the first course and I accidentally enrolled in it (rather than completing the other first and then attempting this one). It turns out the two had nothing to do with each other, with this course focusing more on Data Science techniques (Random Forests, Clustering, etc).
However, like the first course, it provided very little practical education and was more of a confusing overview. At the end of the course, the final project was to work on a Kaggle dataset and submit your analysis as a pdf for review. That’s like spending hours of typo-ridden slides discussing wildlife and then sending you out to the African Safari to hunt and bring home a lion.
The only positive that came out of this was that it did introduce me to Kaggle and I’ve been having some fun there, which I will discuss next month.
Doing Data Science: Straight Talk from the Frontline – This book has the potential to become one of my favorites. However, after Chapter 4, I realized I was not technically or mathematically ready for it.
It’s audience is the one whose gone through extensive studying on the subject already and has some hands-on experience with R and various analytic techniques (even if just at an academic level). It’s not a tutorial or a reference book. It’s a frank discussion about what its like to do Data Science in the real world with practical (and somewhat advanced) discussions of the methodology used.
I loved the blunt style of the authors (Cathy O’Neil, Rachel Schutt) that make you feel like you’re in a coffee shop with them and talking plain dirt about how your day went. What sold me on this book was the following sidebar on page 56 that I landed on while flipping through the pages:
WTF: So Is It an Algorithm or a Model?
Sold. I will be picking up this book again next year when I’m better prepared to do so.