Accomplishments – Jan 2016

I was just about to sit down and begin work on a Machine Learning project but then I realized, with today being Valentine’s Day and all, I need to write about my most recent love (my wife is still my first) – Data Science.

I’m only two weeks overdue on my January summary – but with this, I’m caught up & vow to write more than just my monthly summary. Continuing last’s week’s theme, here are the Good, the OK and the Ugly…

The Good

Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die

This book was INCREDIBLE! It reads like Freakonomics for Data Science.

I found myself rather bogged down in the details of machine learning algorithms, statistical theory and programming techniques. This book, meant for the non-technical person, helped me once again see the forest for the trees.

Each chapter was an intriguing page-turner, describing various data science concepts at a very high level along with real-world, behind-the-scenes stories of where these techniques were used.

Sure, every data science book & course makes reference to the Netflix prize. But an entire chapter is devoted to this story describing some of the intrigue of teams vs lone wolves, mergers, and down-to-the-last-minute race to submit the winning entry.

No – you won’t be a data scientist after reading this book. But you may want to become one or, like me, be reminded why you’re going down this path.

Statistics II for Dummies

I found Statistics for Dummies to be a worthwhile read. The followup, cleverly named Statistics II for Dummies, is a worthy successor.

It covered more advanced topics like Linear Regression and statistical techniques like ANOVA and chi-square tests in a very clear manner, without involving a lot of heavy math.

I was reviewing my notes from the Coursera Regression Models course and this book helped fill the practical gaps in the material. It’s like I had a good buddy who would pull me aside and say, “Hey man, here’s what this really means”.

With so many statistical tools and methods to choose from, it can get confusing to know exactly where and when, let alone HOW, to use them. This book did a particularly great job at that. This will be sure to become on dog-eared tome in my personal library.

Python for Informatics

When I first started this journey back in June, I truly didn’t know what I didn’t know. So I attempted to know everything. That meant trying to learn machine learning at the same time as statistics at the same time as R at the same time as Python at the same time as Big Data at the same time as…

Soon, it became clear that you could only focus on so many things at once, so even though I had successfully completed University of Michigan’s Programming for Everybody (Python), I ended up putting my Python learning on hold in favor of R, mainly because that was the primarily tool being used in the Johns Hopkins University Data Science Specialization in all of its 9 courses.

So in January, I felt ready to take the plunge back into Python and worked through all of the pages of Python for Informatics by Charles Severance. Unlike his distractingly goofy videos, this book is a quick no-nonsense tour of Python basics. If you’re an experienced programmer, you’ll go very quickly through the first few chapters, but then you’ll find very clear & concise explanations of Python-specific elements like dictionaries, lists and tuples. He provides plenty of clear examples (that WORK, unlike Hello! Python) and touches on some advanced concepts like working with databases (using the freely availablt SQLite db), web scraping with BeautifulSoup and other topics.

The final chapter on data visualization is the only mark against this book. It’s very rushed, far too complex given the material that preceded it and doesn’t do the topic justice. Fortunately, there are many other books on that particular subject and I still find this book useful as a quick reference whenever I sit down to code in Python.

The OK

Coursera – Practical Machine Learning

This is Course #8 in the Johns Hopkins University Data Science Specialization.

Although, I must say that “practical” is not the right word to use here, as I walked away from this class not having practically learned very much.

Sure – a look at the syllabus makes shows all of the requisite topics are there: training vs testing data, ROC curves, Cross-validation, principle component analysis, random forests, etc.

But how much useful information could possibly be conveyed in the 4 hours of lectures? (Yes, that’s 4 hours total over the 4 weeks). Answer: Not much. You get a lot of rushed through slides, example after example listed but not well explained and continuous references to the book “An Introduction to Statistical Learning” if you want to learn more (i.e. anything). I did, in fact, refer to that book quite a bit and will write about it with my February summary.

CPUHowever, I did end up learning quite a bit in the discussion forums from students (past and present) who filled in a lot of the gaps in the material. The most exciting thing I’ve taken away from this course was learning how to put all 8 CPU cores of my laptop to work to make a Random Forest learning algorithm fly. First time I ran it, I waited 30 minutes for it to complete. But again, thanks to fellow students in the forums, I learned a quick code tweak and watched my machine go into hyperdrive, completing the same algorithm in under 4 minutes!

That will be a topic for a blog in the near future, I promise.

R and Data Mining – Examples and Case Studies

I came across this book (available as a free pdf) a while back. Published back in 2013, it’s a no-frills tour-de-force of various Machine Learning algorithms in R with examples. It’s sparse on education and explanation, but if you need to quickly look up how to run a k-Means clustering routine without a lot of fuss, you’ll want to have this book handy in your digital library.

R and Data Mining

The Ugly

Communicating Data Science Results

I’ve never been mad at a MOOC before. Sure, some courses are better than others. But this entire Data Science at Scale series, developed by University of Washington on the Coursera platform, is a hands-down ripoff. Do not take it. Do not let your friends take it.

Caught up in the fun of the JHU Data Science Specialization series, I went ahead and pre-paid for this 4 course Specialization. The first two courses were really bad and I’ve written about them before (Data Manipulation at Scale: Systems and Algorithms and Practical Predictive Analytics: Models and Methods)

This third course in the series at least introduces some new professors into the mix, covering soft-core topics like Data Visualization Theory (what colors to use, how to lay out a graph, information conveyance). They were much better produced videos than what Prof. Bill Howe had given in the two previous courses, but they were rather light in substance.

Unfortunately, Prof Howe makes an appearance again to talk some anectdotes about his days sailing the ocean blue on various projects in the context of Cloud-based big data systems. While the topic of Data in the Cloud and the ethics behind data privacy are certainly good topics, you did not walk away with any practical knowledge.

Which brings me to the biggest sin of this course. NO PRACTICAL KNOWLEDGE. The final project required you to create an Amazon Web Services instance (at your own expense), set up an Hadoop instance, run Pig commands to analyze some sort of data and produce some result.

This project had NOTHING to do with the entire course. The instructions given DID NOT MATCH what was on Amazon Web Services. Due dates were incorrect (pointing to dates that were before the course even started). There was NO SUPPORT from the instructor or any Coursera staff member. In essence, I ended up spending an additional $50 fumfering (sp?) around with AWS getting no practical skills out of it whatsoever.

Stay away from this cash-grab.

Statistical Thinking for Data Science and Analytics

I had read an online announcement that Columbia University was jumping into the MOOC ring with their own 3 course offering on Data Science.
I took their first course, Statistical Thinking for Data Science and Analytics, expecting to reinforce some topics I was already familiar with and gain some additional insights.

The course covered topics on statistics, data modeling (i.e. regression & clustering), visualization theory, practical applications in the Health industry and a week on Bayes modeling. Initially, I was impressed with the professional presentation of the introductory videos an initial lectures.Columbia

Soon thereafter, I began to count my blessings that I did not send my children to Columbia University. While the instructors may have been very knowledgable and accomplished researchers their lecturing skills leave a lot to be desired. From thick accents that made it difficult to follow to physical tics that distracted from the lesson, I found this entire series a struggle to watch.

The lessons seemed mashed together without an overall pedagogy in mind. In addition, the practice problems at the end of various units were often trivial with the harder questions being hard only because of very poor wording.

The only bright spot in this 5-week course cam at the end of Week 2 where Prof. David Madigan gave a number of talks about the utilization (and misutilization) of data science in the Health industry, focusing on observational data and experiments.

Otherwise, take a pass. I’ll write up my review of the second course in this series (Machine Learning for Data Science and Analytics) in my February review.

Onward to February

 

Accomplishments – Dec 2015

ChristmasTreeWell here it is, Feb 2016, and I’m just sitting down to jot down how I’ve moved forward in my journey during December. To be fair, it was a busy month: I gave my notice at my place of employment for the past 15 years, began the process of moving from two different houses simultaneously (in NY and SC), and truly working to control my destiny for the remainder of my professional career! But more on that later.

So what happened in December (month 6 of this endeavor)?

I’ll start with the good items:

Kaggle & the Titanic Data Tutorial

Soon after you start dipping your toes into Data Science, you learn about Kaggle, the online Data Science competition site. Companies post their data sets for you to chew on, develop a predictive model for and then apply it to a secret test data set to see how accurate you model was. It’s not a place for true beginners – there’s some advanced people competing for very real prizes out there. But it is a place I intend to visit with increasing frequency as I start to migrate away from courses on theory and want to see how people are actually doing it out there.  It’s a very cooperative group and the various solutions people have come up with are often shared once the competition is over.Titanic

However, I did work through a very useful “how to play in the Kaggle space” tutorial that utilized a data set of passengers on the Titanic. The goal was to create a predictive model that could help determine a passenger classification (survive or die) based on a number of known characteristics. This offering by DataCamp was one of the most clear, straightforward, hands-on, practical tutorials I’ve ever come across. I can’t recommend it enough. In addition to walking you through the creation of a couple of models, it also guides you through the mechanisms of working in Kaggle: grabbing a training data set, applying your model to the test data and uploading your predictions to get a score. It was the most fun I’ve had so far.

DataSmart

I reviewed this book back in August , but at the time I had just done a readthrough of the material. I found it so good, I actually went back and manually worked through all of the Chapters that were painstakingly detailed with how to perform the techniques in Excel.

I gained more understanding about various DataScience techniques and algorithms (like K-nearest-neighbors, ROC curves, time series forecasting, etc) than I had in any other book or course I had taken up to this point.9781118661468 cover.indd

The punchline is delivered in the last chapter where you’re introduced to R and you find that the 10 hours you spent back in Chapter 7 constructing an Ensemble Model in Excel, utilizing bagging and boosting techniques, could be done in 10 minutes through a few lines of R code and the proper packages imported.

ExcelModel

But the experience of getting dirty with the algorithms, keying in formulas an applying them to hundreds of cells and seeing the impact on intermediate and final results… Well, that was just priceless. I don’t want the R libraries to be magical black boxes for me. And without diving into advanced mathematics, DataSmart gave me a very good understanding of what’s happening behind the scenes.

Statistics for Dummies

I do hold a Bachelor’s degree in Mathematics, but that was over 20 yrs ago and my Statistics had gotten a bit rusty.

I’m not ashamed to have purchased Statistics for Dummies. I actually enjoyed the read and made myself a cheat sheet of various key equations used in Probability and Statistics.

While not quite as good as Comprehending Behavioral Statistics (which I reviewed back in August) it was a solid enough book that I’ve gone back to a couple of times to remind myself of how to determine z-values, t-values and p-values. Speaking of p-values, this book had a really good explanation of Hypothesis Testing, which can be a bit confusing to the beginner.

The first half of the book is extremely basic (mean, median, histograms, etc). But by Chapter 8 when you start looking at Binomial & Normal distributions, the Central Limit Theorem and correlation, you’ll find the value in this book.

There’s a couple of throwaway chapters at the end about polling and experiments, but otherwise I can recommend this book.


Now for a couple of clunkers


Hello! Python

So far, I’ve spent most of my time in R. But I do realize that I don’t want to be a one-trick data science pony, so to round out my skillset I decided to begin to learn Python a bit more in December.

HelloPythonlAt my local Barnes & Nobel, I picked up Hello! Python by Anthony Briggs. Briefly flipping through the book, it had small cartoons on every other page (fun!), hands-on activities that get you coding right away (practical!) and it covered a wide variety of subjects from basic Python to the Django web app framework to object oriented programming (diverse)!

I wasn’t too concerned that it was published in 2012. It used Python 2.7 as the base, but as this was to be an introduction to the language, I thought it would be just fine as the core language shouldn’t have changed and I could pick up 3.x nuances later.

This book isn’t an introduction to programming. Although I’m an experienced Java programmer and I found many sections to be baffling. Concepts where introduced out of nowhere without explanation. For example, on pg. 222:

Use **kwargs, avoid using plain arguments, and always pass all the arguments you receive to any parent methods

Um – what are “**kwargs”? This was the first mention of that and no explanation was provided. Not even an acknowledgement that it’s a new concept whose definition was deferred to a later chapter. Just dropped on you out of nowhere. Unfortunately, this wasn’t the only example.

In a chapter about iterators and generators (powerful Python concepts) he dropped a meager 2 page discussion on regular expressions so that he could demonstrate their application on Apache logs. Sorry – but an anemic discussion on regex isn’t going to help interpret something like:

regex

The provided code was riddled with typos throughout the book. I’m flummoxed as to how it made it to print.  Several segments should have generated errors if actually ran by the author before committing it to page.

Anyhow, I could go on and on about the difficulties with installing packages (which could not be done using the book’s instructions), using live, ever-changing websites to demonstrate code (Yahoo! Finance?) and a general failure to build a strong foundation in Python.

And bless you if you can ever get Django to work in Chapter 8. Every step of the way was a fight & struggle – progress made only by NOT following the instructions given in the book (for example, how to configure settings.py to find index.tmpl (hint: it’s not in TEMPLATE_DIRS like the book says).

dumbcartoonWorst of all, the little cartoons were not in the least bit amusing nor did they contribute anything but a distraction to the subject. Now I love cartoons as a form of learning. Take a look at Larry Gonick’s series on Statistics or Physics. Classics! Even the “for Dummies” 5th Wave cartoons are on-topic and humorous. Not these self-drawn, self-serving doodles.

Not recommended.

Coursera – Regression Models

This is Course #7 in Johns Hopkins University’s Data Science Specialization.

So far, I’ve been very complimentary of this Coursera Specialization. Unfortunately, we hit a low point with this lesson.

I actually ended up taking this course again in January even though I completed everything but the final project (again, due Christmas weekend at the end of the Week 3, which also the largest amount of material and a difficult quiz?).

mathBut, setting aside the Grinch who scheduled this course, it was a highly mathematical, extremely rushed and unfocused review of linear regression modeling.

I’m not afraid of the math (I do have a degree in it after all), but the context was lost in a rush to get through the equations. The big picture was never really there (a picture that didn’t improve all that much from retaking the course in January). Fortunately, I’ve come across additional resources which helped fill in a number of the gaps.

The 4th week covered a large number of topics (GLMs, Logistic Regression, Possion regression and “knot” points) and did each of them a disservice. For example, here is an actual notes segment from one of the videos:
knot

After this, I still don’t know the different between “knot points” and “garlic knots”. This course would have been better served to have covered fewer topics at a slower pace, or split into two separate courses where these topics could have been given their due. I don’t think I was up to following an “Prove to yourself” recommendations.

One final complaint – the course project involved a very open-ended linear regression analysis of the standard mtcars data set (data on a number of cars from then 1970’s). You were to hit a number of rubric evaluation points – but you had to do it within an artificially established limit 2 pages of text with 3 pages appendix of charts and tables. I spent more time playing with fonts & margins to get my analysis within this limit than I did on the actual subject matter at hand. I understand the desire to be concise, but this was the number one complaint on the course discussion boards – everyone was struggling with this imposition that did nothing to enhance the learning experience.

Other Endeavors Initiated

The Christmas & New Years season is NOT the best time to be taking courses with deadlines – especially when those deadlines include projects and quizzes due Christmas weekend (don’t those people at Coursera have any Holiday Spirit)?

Nonetheless, I started on the following programs and will have more to write about them with my January update: