Well here it is, Feb 2016, and I’m just sitting down to jot down how I’ve moved forward in my journey during December. To be fair, it was a busy month: I gave my notice at my place of employment for the past 15 years, began the process of moving from two different houses simultaneously (in NY and SC), and truly working to control my destiny for the remainder of my professional career! But more on that later.
So what happened in December (month 6 of this endeavor)?
I’ll start with the good items:
Kaggle & the Titanic Data Tutorial
Soon after you start dipping your toes into Data Science, you learn about Kaggle, the online Data Science competition site. Companies post their data sets for you to chew on, develop a predictive model for and then apply it to a secret test data set to see how accurate you model was. It’s not a place for true beginners – there’s some advanced people competing for very real prizes out there. But it is a place I intend to visit with increasing frequency as I start to migrate away from courses on theory and want to see how people are actually doing it out there. It’s a very cooperative group and the various solutions people have come up with are often shared once the competition is over.
However, I did work through a very useful “how to play in the Kaggle space” tutorial that utilized a data set of passengers on the Titanic. The goal was to create a predictive model that could help determine a passenger classification (survive or die) based on a number of known characteristics. This offering by DataCamp was one of the most clear, straightforward, hands-on, practical tutorials I’ve ever come across. I can’t recommend it enough. In addition to walking you through the creation of a couple of models, it also guides you through the mechanisms of working in Kaggle: grabbing a training data set, applying your model to the test data and uploading your predictions to get a score. It was the most fun I’ve had so far.
I reviewed this book back in August , but at the time I had just done a readthrough of the material. I found it so good, I actually went back and manually worked through all of the Chapters that were painstakingly detailed with how to perform the techniques in Excel.
I gained more understanding about various DataScience techniques and algorithms (like K-nearest-neighbors, ROC curves, time series forecasting, etc) than I had in any other book or course I had taken up to this point.
The punchline is delivered in the last chapter where you’re introduced to R and you find that the 10 hours you spent back in Chapter 7 constructing an Ensemble Model in Excel, utilizing bagging and boosting techniques, could be done in 10 minutes through a few lines of R code and the proper packages imported.
But the experience of getting dirty with the algorithms, keying in formulas an applying them to hundreds of cells and seeing the impact on intermediate and final results… Well, that was just priceless. I don’t want the R libraries to be magical black boxes for me. And without diving into advanced mathematics, DataSmart gave me a very good understanding of what’s happening behind the scenes.
Statistics for Dummies
I do hold a Bachelor’s degree in Mathematics, but that was over 20 yrs ago and my Statistics had gotten a bit rusty.
I’m not ashamed to have purchased Statistics for Dummies. I actually enjoyed the read and made myself a cheat sheet of various key equations used in Probability and Statistics.
While not quite as good as Comprehending Behavioral Statistics (which I reviewed back in August) it was a solid enough book that I’ve gone back to a couple of times to remind myself of how to determine z-values, t-values and p-values. Speaking of p-values, this book had a really good explanation of Hypothesis Testing, which can be a bit confusing to the beginner.
The first half of the book is extremely basic (mean, median, histograms, etc). But by Chapter 8 when you start looking at Binomial & Normal distributions, the Central Limit Theorem and correlation, you’ll find the value in this book.
There’s a couple of throwaway chapters at the end about polling and experiments, but otherwise I can recommend this book.
Now for a couple of clunkers
So far, I’ve spent most of my time in R. But I do realize that I don’t want to be a one-trick data science pony, so to round out my skillset I decided to begin to learn Python a bit more in December.
At my local Barnes & Nobel, I picked up Hello! Python by Anthony Briggs. Briefly flipping through the book, it had small cartoons on every other page (fun!), hands-on activities that get you coding right away (practical!) and it covered a wide variety of subjects from basic Python to the Django web app framework to object oriented programming (diverse)!
I wasn’t too concerned that it was published in 2012. It used Python 2.7 as the base, but as this was to be an introduction to the language, I thought it would be just fine as the core language shouldn’t have changed and I could pick up 3.x nuances later.
This book isn’t an introduction to programming. Although I’m an experienced Java programmer and I found many sections to be baffling. Concepts where introduced out of nowhere without explanation. For example, on pg. 222:
Use **kwargs, avoid using plain arguments, and always pass all the arguments you receive to any parent methods
Um – what are “**kwargs”? This was the first mention of that and no explanation was provided. Not even an acknowledgement that it’s a new concept whose definition was deferred to a later chapter. Just dropped on you out of nowhere. Unfortunately, this wasn’t the only example.
In a chapter about iterators and generators (powerful Python concepts) he dropped a meager 2 page discussion on regular expressions so that he could demonstrate their application on Apache logs. Sorry – but an anemic discussion on regex isn’t going to help interpret something like:
The provided code was riddled with typos throughout the book. I’m flummoxed as to how it made it to print. Several segments should have generated errors if actually ran by the author before committing it to page.
Anyhow, I could go on and on about the difficulties with installing packages (which could not be done using the book’s instructions), using live, ever-changing websites to demonstrate code (Yahoo! Finance?) and a general failure to build a strong foundation in Python.
And bless you if you can ever get Django to work in Chapter 8. Every step of the way was a fight & struggle – progress made only by NOT following the instructions given in the book (for example, how to configure settings.py to find index.tmpl (hint: it’s not in TEMPLATE_DIRS like the book says).
Worst of all, the little cartoons were not in the least bit amusing nor did they contribute anything but a distraction to the subject. Now I love cartoons as a form of learning. Take a look at Larry Gonick’s series on Statistics or Physics. Classics! Even the “for Dummies” 5th Wave cartoons are on-topic and humorous. Not these self-drawn, self-serving doodles.
Coursera – Regression Models
So far, I’ve been very complimentary of this Coursera Specialization. Unfortunately, we hit a low point with this lesson.
I actually ended up taking this course again in January even though I completed everything but the final project (again, due Christmas weekend at the end of the Week 3, which also the largest amount of material and a difficult quiz?).
I’m not afraid of the math (I do have a degree in it after all), but the context was lost in a rush to get through the equations. The big picture was never really there (a picture that didn’t improve all that much from retaking the course in January). Fortunately, I’ve come across additional resources which helped fill in a number of the gaps.
The 4th week covered a large number of topics (GLMs, Logistic Regression, Possion regression and “knot” points) and did each of them a disservice. For example, here is an actual notes segment from one of the videos:
After this, I still don’t know the different between “knot points” and “garlic knots”. This course would have been better served to have covered fewer topics at a slower pace, or split into two separate courses where these topics could have been given their due. I don’t think I was up to following an “Prove to yourself” recommendations.
One final complaint – the course project involved a very open-ended linear regression analysis of the standard mtcars data set (data on a number of cars from then 1970’s). You were to hit a number of rubric evaluation points – but you had to do it within an artificially established limit 2 pages of text with 3 pages appendix of charts and tables. I spent more time playing with fonts & margins to get my analysis within this limit than I did on the actual subject matter at hand. I understand the desire to be concise, but this was the number one complaint on the course discussion boards – everyone was struggling with this imposition that did nothing to enhance the learning experience.
Other Endeavors Initiated
The Christmas & New Years season is NOT the best time to be taking courses with deadlines – especially when those deadlines include projects and quizzes due Christmas weekend (don’t those people at Coursera have any Holiday Spirit)?
Nonetheless, I started on the following programs and will have more to write about them with my January update:
- edX & Columbia University’s Statistical Thinking for Data Science and Analytics
- Book: Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die
- Book: R and Data Mining – Examples and Case Studies
- Book: Python for Informatics
- Book: Statistics II for Dummies