I was just about to sit down and begin work on a Machine Learning project but then I realized, with today being Valentine’s Day and all, I need to write about my most recent love (my wife is still my first) – Data Science.
I’m only two weeks overdue on my January summary – but with this, I’m caught up & vow to write more than just my monthly summary. Continuing last’s week’s theme, here are the Good, the OK and the Ugly…
Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die
This book was INCREDIBLE! It reads like Freakonomics for Data Science.
I found myself rather bogged down in the details of machine learning algorithms, statistical theory and programming techniques. This book, meant for the non-technical person, helped me once again see the forest for the trees.
Each chapter was an intriguing page-turner, describing various data science concepts at a very high level along with real-world, behind-the-scenes stories of where these techniques were used.
Sure, every data science book & course makes reference to the Netflix prize. But an entire chapter is devoted to this story describing some of the intrigue of teams vs lone wolves, mergers, and down-to-the-last-minute race to submit the winning entry.
No – you won’t be a data scientist after reading this book. But you may want to become one or, like me, be reminded why you’re going down this path.
Statistics II for Dummies
It covered more advanced topics like Linear Regression and statistical techniques like ANOVA and chi-square tests in a very clear manner, without involving a lot of heavy math.
I was reviewing my notes from the Coursera Regression Models course and this book helped fill the practical gaps in the material. It’s like I had a good buddy who would pull me aside and say, “Hey man, here’s what this really means”.
With so many statistical tools and methods to choose from, it can get confusing to know exactly where and when, let alone HOW, to use them. This book did a particularly great job at that. This will be sure to become on dog-eared tome in my personal library.
Python for Informatics
When I first started this journey back in June, I truly didn’t know what I didn’t know. So I attempted to know everything. That meant trying to learn machine learning at the same time as statistics at the same time as R at the same time as Python at the same time as Big Data at the same time as…
Soon, it became clear that you could only focus on so many things at once, so even though I had successfully completed University of Michigan’s Programming for Everybody (Python), I ended up putting my Python learning on hold in favor of R, mainly because that was the primarily tool being used in the Johns Hopkins University Data Science Specialization in all of its 9 courses.
So in January, I felt ready to take the plunge back into Python and worked through all of the pages of Python for Informatics by Charles Severance. Unlike his distractingly goofy videos, this book is a quick no-nonsense tour of Python basics. If you’re an experienced programmer, you’ll go very quickly through the first few chapters, but then you’ll find very clear & concise explanations of Python-specific elements like dictionaries, lists and tuples. He provides plenty of clear examples (that WORK, unlike Hello! Python) and touches on some advanced concepts like working with databases (using the freely availablt SQLite db), web scraping with BeautifulSoup and other topics.
The final chapter on data visualization is the only mark against this book. It’s very rushed, far too complex given the material that preceded it and doesn’t do the topic justice. Fortunately, there are many other books on that particular subject and I still find this book useful as a quick reference whenever I sit down to code in Python.
Coursera – Practical Machine Learning
Although, I must say that “practical” is not the right word to use here, as I walked away from this class not having practically learned very much.
Sure – a look at the syllabus makes shows all of the requisite topics are there: training vs testing data, ROC curves, Cross-validation, principle component analysis, random forests, etc.
But how much useful information could possibly be conveyed in the 4 hours of lectures? (Yes, that’s 4 hours total over the 4 weeks). Answer: Not much. You get a lot of rushed through slides, example after example listed but not well explained and continuous references to the book “An Introduction to Statistical Learning” if you want to learn more (i.e. anything). I did, in fact, refer to that book quite a bit and will write about it with my February summary.
However, I did end up learning quite a bit in the discussion forums from students (past and present) who filled in a lot of the gaps in the material. The most exciting thing I’ve taken away from this course was learning how to put all 8 CPU cores of my laptop to work to make a Random Forest learning algorithm fly. First time I ran it, I waited 30 minutes for it to complete. But again, thanks to fellow students in the forums, I learned a quick code tweak and watched my machine go into hyperdrive, completing the same algorithm in under 4 minutes!
That will be a topic for a blog in the near future, I promise.
R and Data Mining – Examples and Case Studies
I came across this book (available as a free pdf) a while back. Published back in 2013, it’s a no-frills tour-de-force of various Machine Learning algorithms in R with examples. It’s sparse on education and explanation, but if you need to quickly look up how to run a k-Means clustering routine without a lot of fuss, you’ll want to have this book handy in your digital library.
Communicating Data Science Results
I’ve never been mad at a MOOC before. Sure, some courses are better than others. But this entire Data Science at Scale series, developed by University of Washington on the Coursera platform, is a hands-down ripoff. Do not take it. Do not let your friends take it.
Caught up in the fun of the JHU Data Science Specialization series, I went ahead and pre-paid for this 4 course Specialization. The first two courses were really bad and I’ve written about them before (Data Manipulation at Scale: Systems and Algorithms and Practical Predictive Analytics: Models and Methods)
This third course in the series at least introduces some new professors into the mix, covering soft-core topics like Data Visualization Theory (what colors to use, how to lay out a graph, information conveyance). They were much better produced videos than what Prof. Bill Howe had given in the two previous courses, but they were rather light in substance.
Unfortunately, Prof Howe makes an appearance again to talk some anectdotes about his days sailing the ocean blue on various projects in the context of Cloud-based big data systems. While the topic of Data in the Cloud and the ethics behind data privacy are certainly good topics, you did not walk away with any practical knowledge.
Which brings me to the biggest sin of this course. NO PRACTICAL KNOWLEDGE. The final project required you to create an Amazon Web Services instance (at your own expense), set up an Hadoop instance, run Pig commands to analyze some sort of data and produce some result.
This project had NOTHING to do with the entire course. The instructions given DID NOT MATCH what was on Amazon Web Services. Due dates were incorrect (pointing to dates that were before the course even started). There was NO SUPPORT from the instructor or any Coursera staff member. In essence, I ended up spending an additional $50 fumfering (sp?) around with AWS getting no practical skills out of it whatsoever.
Stay away from this cash-grab.
Statistical Thinking for Data Science and Analytics
I had read an online announcement that Columbia University was jumping into the MOOC ring with their own 3 course offering on Data Science.
I took their first course, Statistical Thinking for Data Science and Analytics, expecting to reinforce some topics I was already familiar with and gain some additional insights.
The course covered topics on statistics, data modeling (i.e. regression & clustering), visualization theory, practical applications in the Health industry and a week on Bayes modeling. Initially, I was impressed with the professional presentation of the introductory videos an initial lectures.
Soon thereafter, I began to count my blessings that I did not send my children to Columbia University. While the instructors may have been very knowledgable and accomplished researchers their lecturing skills leave a lot to be desired. From thick accents that made it difficult to follow to physical tics that distracted from the lesson, I found this entire series a struggle to watch.
The lessons seemed mashed together without an overall pedagogy in mind. In addition, the practice problems at the end of various units were often trivial with the harder questions being hard only because of very poor wording.
The only bright spot in this 5-week course cam at the end of Week 2 where Prof. David Madigan gave a number of talks about the utilization (and misutilization) of data science in the Health industry, focusing on observational data and experiments.
Otherwise, take a pass. I’ll write up my review of the second course in this series (Machine Learning for Data Science and Analytics) in my February review.
Onward to February
- Statistics Done Wrong (Reinhart)
- Stanford University’s MOOC – Statistical Learning
- Coursera – Developing Data Products
- Learning Python – 5th edition
- University of Washington’s Machine Learning Foundations
- Datacamp’s Python for Data Science
- Columbia’s Machine Learning for Data Science and Analytics