It was a good month. A very good month. A month where I’m actually sitting down on the last day of the month to summarize my activities.
9 months. Bite-sized chunks. And I’m amazed at how far I’ve come from just knowing about Data Science and Big Data as buzzwords that a technology executive should know something about.
But what a deep, rich topic this is! And you couldn’t ask for a more open community, seemingly bending backwards to bring more in the fold. I can name a large number of influences in my journey – but I’ll save that for a later post in March.
Now, onto the Good, the OK and the Ugly for this month:
Coursera – Developing Data Products
After a couple of weak courses, this 9th and final formal course in the the Johns Hopkins University Data Science Specialization was one of the most fun as well. It was a welcome, creative break to the heavily mathematical and theoretical Regression and Machine Learning courses.
In this course, you learn how to take all of the analytics you developed in R and share it with the universe in an interactive, browser-based format using the Shiny development platform. Go ahead and take a look at my project. It’s a simple, interactive graph, constructed in a few hours (and as a novice, at that). Here’s an absolutely amazing Shiny application developed by some guy in New Zealand, if you want to see an example of the full potential of this platform.
You also learn how to create presentations using Slidify and RStudio Presenter, where you transform your R code into PowerPoint-like slides which can be easily regenerated if you alter your R analysis. My very brief “pitch” presentation for the small app I created can be found on Rpubs.com.
Content-wise, this was a much lighter course. I learned more from the tutorials on the various application and package websites than from the course lectures. But this course did expose me to a large number of tools that can be used to share whatever data insights I might be lucky to find in the future. And the final course project to get my feet wet in those technologies was worth the price of admission!
Now… onto the final step of this JHU Specialization – the Capstone Project!
Statistics Done Wrong
What a fun, little book this was. It’s like having a big brother who, after you’ve taken many semesters of hard-core statistics, pulls you aside, gives you a cold beer and tells it to you like it is. Alex Reinhart should be given a lot of credit for surveying the field of science and analysis, pointing out the traps that the professionals fall into and what you should do to avoid them.
Remember that p-value thing you spent a lot of time worrying about? This book gives you the straight scoop on this and other subjects that all scientists and statisticians should know before publicizing results. It’s light on the math (the text assumes you’ve already done the time with the equations elsewhere) but it is heavy on the real-world pitfalls of interpreting the math incorrectly and shares stories from the field where it’s bitten some research groups in a tender spot.
The chapters are short and bite-sized, but the value-per-page ratio is among the best of any book I’ve read so far. You should grab and digest this work – but only after you’ve put your time in with other statistics books and courses elsewhere. This is something that needs to be read after you’ve “learned” statistics, to help iron out the wrinkles in your knowledge.
Stanford’s Statistical Learning
The Stanford Coursera course on Machine Learning, by Andrew Ng, is famous for launching the Coursera platform and was one of the first courses I took. But there’s these other guys at Stanford who may have one-upped Mr. Ng with their own edX
Starring the “Click and Clack” of Machine Learning (Trevor Hastie and Rob Tibshirani), these loveable guys take you on a guided tour of topics like regression, classification methods, cross-validation, trees and random forests, support vector machines and the like.
This 9 week course comes with an accompanying textbook (also free), An Introduction to Statistical Learning with Applications in R (6th Ed 2015) which is rich in examples, lab exercises in R and exercises with each chapter.
You don’t need to register for the course to see the videos. A full list of lectures are available online already. However, the course offers challenging quizzes at the end of each lecture (1 or 2 questions each). The goal is to pass with 50% (which is more challenging than you think), but the questions require thought to extend what was learned in that lesson and you will learn much from getting a question wrong. The course also has a very active discussion forum that far more enriching than any forum in courses I’ve taken to date.
I’ve made it through Week 7 (with two more weeks to go in March). While I’ve attempted to read the textbook along with the lecture videos, I will be planning to go back through the text a second time once this course is over and work through all of the labs and exercises again. This is one course I’ll be referring to time and time again.
After 6 months of exclusively working in R, I decided I wanted to be bilingual and gain knowledge about Python. I’ll refer you to the Programming section of my Journey’s Roadmap to refresh you on the little I’ve done in Python so far. But with an Amazon gift card I got for Christmas, I took the plunge on the 1500 page tome, Learning Python, by Mark Lutz. I’m proud to say that I’ve made it through 500 pages in February and am on track to finish it sometime in April (assuming I skip over the index pages at the end).
This definitely shouldn’t be the first thing you dive into to learn Python. Take a gentle tutorial or two first. Read Python for Informatics to get a handle on the basics. Maybe write a program or two on your own to get a feel for the language. Then, definitely acquire this book.
It does not take a “learn by doing projects approach” that Hello! Python failed miserably at. It really does read like a text book, covering very narrow topics in each of its many, many chapters, punctuated with a large number of small, demonstrative examples. But you will need a book like this to fill in the gaps that the other books leave out. See my complaint about how **kwargs was introduced in my Hello! Python diatribe. Well, after Chapter 18 of “Learning Python” I think I understand what it’s all about. I certainly won’t remember **kwargs six months from now, but when I encounter it again in code I may be reviewing, I’ll know exactly where I’ll be going for a refresher.
The book may seem slow-paced and repetitive at first, but that was fine with me. Despite claims I read about Python’s simplicity being its strength, I’ve come to learn that it is indeed a very rich, complex, powerful language. This text, with on-topic Q&A’s at the end of each chapter, appears to guide you through this language better than most books out there.
I’ll follow up next month, as I plan to be up through Chapter 30 (out of 41) of this 10 lb weapon of knowledge.
Nothing to report in this category this month. The activities I pursued were either excellent – or clunkers. See below.
edX & Columbia University – Machine Learning for Data Science and Analytics
Columbia University announced a three-course “X-Series” in Data Science several months back. I took the first course, “Statistical Thinking for Data Science and Analytics” and reviewed it negatively last month.
I took the second course in the series “Machine Learning for Data Science and Analytics” hoping for a better outcome. Instead, it slid even further backwards in terms of meeting expectations.
Here’s what I wrote on a message board in a thread where students were expressing how baffled they were by Quiz questions that were completely irrelevant to the covered material (which was itself was irrelevant to Machine Learning):
I appreciate all of the comments in this thread. At the end of the day, you have to ask yourself “Are you learning anything?”. And after having completed “Statistical Thinking for Data Science and Analytics” and made it through the end of week 3 of “Machine Learning for Data Science and Analytics” I can honestly say “no” and will be discontinuing this series from Columbia University.
I appreciate the hard work they put into this, but it’s apparent they do not have a comprehensive Data Science curriculum. This series is a mashup of lectures where they get some people from the math department, some people from the computer science department and some people from the applied sciences (i.e. biology), string together some unrelated lectures and call it a “Data Science” program.
This is not coming from someone who is bitter about the homework grades (I’ve scored very high on the practice problems), but after watching meandering lectures on NP-hard problems and then struggling with the language and bugs of the assignments, I’m sitting here realizing that I’m not moving forward in my interest in Data Science.
Aside from the material, which was a patchwork of irrelevant subjects, there were too many bugs in the quiz assignments (i.e. you had to type in 4.63 as 4.6 would be marked wrong without any indication of expected precision in the answer). In fact, I just checked back after having abandoned this course two weeks ago and see this announcement:
We have removed homework 4e, Ensemble Classifiers, from grading. It will not count towards your overall grade in the course. In addition, we have changed the passing grade for this course to be 50% or higher.
Thank you for all your comments and feedback,
Sadly, similar announcements were posted every week. Stay away from this program.
Mathematical Monk – Machine Learning
The 19-yr-old Data Science wunderkind, Safia Abdalla, turned me onto the Mathematical Monk, an enigmatic post-doc at Duke University who posted 160 (!) lecture videos on YouTube on various topics about Machine Learning.
Using a style reminiscent of Khan Academy, each lecture is between 10-15 minutes in length, given in a smooth voice and utilizing a colorful digital pen. Unfortunately, those are its only strengths.
I watched and took notes on the first 15 videos and found them to be unnecessarily confusing, far too heavy on the math theory (no practical examples covered), and very inconsistent on the variable notation. As a former Physics teacher (as well as having a degree in Mathematics), the biggest sin you can make in teaching math is to be inconsistent with your variables. What is X vs x? Or X’ vs X vs x’ vs x. Does x represent a single value or a vector of values? It changes from lecture to lecture and sometimes within the lecture itself.
I know this guy is very smart, and far more of an expert in the field than I. But even so, early on he muffed the clichéd Cancer Test table, stating each square represented whether a test should be given, rather than the rates at which the test gives false positives and negatives.
Based on the quality of those first 15 videos, I did not proceed with the remaining 145. Maybe they got better over time?
Coursera – University of Washington – Data Science at Scale
Coursera offers various “Specializations” – groupings of courses given by the same University on a given topic (for example, the Johns Hopkins Data Science Specialization). You pre-pay up front at a discount and you have one year to complete all of the courses in the specialization, with the option to retake courses if your schedule forces you to do so. That flexibility and discount really works in favor of those student who are committed to complete it.
I’ve already expounded at length at how disappointing the three courses in the Data Science at Scale Specialization were. But I did watch all of the videos and completed the assignments. And I may have learned a thing or two – even if it fell below my expectations.
However, there is one final course in this specialization – the Capstone Project. So far, it’s been delayed for 4 consecutive months and there’s no indication that it will ever be released. I’ve asked for a refund and so far Coursera has refused.
I’ll write more about these at the end of March!