April was a light month for me. It marked the end of my 19 year career with my previous employer as well as the beginning of preparation to sell my house in NY (and an eventual relocation to South Carolina – and hopefully, a new career in data science).
But my personal story is for another blog, another week. Despite the “real world” distractions, I was able to move forward a little bit in my self-driven DS education.
Coursera – Machine Learning: Regression
This was the second course in the 6-part Machine Learning series developed by the University of Washington. I gave the first class (Machine Learning Foundations) a very positive review last month. Where that class was a hands-on, high-level overview of all the topics covered in the specialization, this course gets very detailed on various Regression topics including:
- Simple and multiple linear regression
- Gradient descent vs closed form solutions to minimizing residual sum of squares
- Feature selection using ridge regression and lasso
- Model selection using validation sets and cross-validation
- Non-parameterized solutions like Nearest Neighbors and kernel regression
This was far from a walk in the park. The lectures given exclusively by Emily Fox were some of the clearest I’ve seen on the subject. Her explanation of lasso and how it causes coefficients to drop to zero was just about the best I’ve seen. She provided video simulations, optional math derivations (hint: don’t let them be optional for you), and annotated lecture notes for each week.
I thought the quiz questions were non-trivial but very relevant to the material covered. The self-guided programming assignments were delivered as iPython Notebooks and required you to develop your own functions that implemented the algorithms discussed in the lectures and while challenging, they were not impossible to create.
The only drawback to this otherwise perfect course was the default utilization of GraphLab Create, a retail ML library that’s an alternative to scikit-learn and pandas (you get to use it for free for up to 1 year). While you are free to use standard, open libraries to work on the programming assignments, if you’re not familiar with them (like myself), you’re are somewhat forced to use GraphLab Create. It didn’t detract from the knowledge I gained about the ML techniques but in hindsight I wish I could’ve gained knowledge and experience on the standard Python libraries in the process.
I’ve moved onto the third course in this series, Machine Learning: Classification, and now with a little more Python confidence under my belt, I’m going to attempt to do the assignments using standard Python libraries.
Natural Language Processing YouTube videos
If you’ve got a little free time on your hands, you can digest a series of 102 videos on the topic of Natural Language Processing given by Stanford professors Dan Jurafsky & Chris Manning. This very specific offshoot of Machine Learning and statistical techniques was the subject of the JHU Data Science Specialization Capstone Project that I still need to complete. So while I’m waiting for the end of May for that session to open, I spent a little time here to familiarize myself on the topic.
Wow – what a deep and rich subject area this is. The things we take for granted: Google searches, spellcheck, next-word anticipation, Siri question translation… All of these have roots here in NLP and it was a fascinated trip through basic techniques like word-frequency measurements to n-grams and smoothing techniques. I only made it through to video #32, but I do feel a lot more well-versed on this subject and I’ve bookmarked this YouTube playlist as a go-to source of information on NLP.
The Theory That Would Not Die
This was an interesting read I picked up with an Amazon Kindle gift card. I’m was in the hunt for a good description of how Bayesian inference is performed – and I’m still in that hunt after completing this book.
This is not a mathematics book; it’s a history book – and a fairly interesting one at that. It was a welcome break from dealing with program libraries and mathematical derivations. Instead, you saw the evolution of a mathematical ideas over the course of centuries and how that idea had some very real-world implications.
You will need to get past the first two chapters before it becomes a page-turner, but the wait is well worth it. You see where Bayesian concepts competed with other ideas (and lost out just as often as it won). You also read how progress is made as well as impeded by the personalities of the talents involved. Based on this book, you’d think the great mathematicians were brilliant jerks.
In all, you won’t find any but the lightest of mathematical equations, but you will get an appreciation of why we need data science in the world.
Data Science from Scratch: First Principles with Python
Data Science from Scratch was one of the first books I picked up and reviewed back in August, 2015. However, being so new to the subject and having had zero exposure to Python, I had to shelve this book for a later date.
Well, April 2016 was that later date and while I no longer felt the subject matter or choice of programming language to be an impediment, I still abandoned this book halfway through it.
For me, it failed on two levels:
- Expanding my knowledge of Python
- Expanding my knowledge of Data Science
Being that those two items are contained in the title of the book, that’s a serious issue.
From a Python perspective, I understand that this wasn’t a Python book. I’ve got other books for that. But when providing code examples for data analysis concepts, the author chose to implement methods from scratch and ignore any well-established libraries that are used in the real world (like scikit-learn, numpy or pandas). Yes, “from scratch” is also in the title, on on that point the book held true. But take for example the chapter on Statistics: instead of using the well-established functions for the common statistical calculations, you create your own methods to calculate mean, variance, correlation, etc. You build new methods on top of old methods and before you know it, you’re creating your own library of functions. And only in the “further references” section at the end of the chapter do you see a suggestion to go look up numpy documentation.
The chapters were very short, and as such it could not convey the data science concepts very well. And with half the chapter spent on developing your own code for algorithms already established elsewhere, there is not a lot of room to discuss topics properly like Gradient Descent, Bayes’ Theorem or Hypothesis Testing. You are better served on each of these topics elsewhere.
Overall, I did not see a lot of value in what this book offers.
Looking ahead to May
My primary goal for May is to put my house up for sale. There’s a lot of work to be done in order to make it sell-worthy, but somewhere I hope to be working on the following tracks: