Accomplishments – Mar 2016

Here I am in my 10th month of a self-paced, self-designed data science curriculum. The gears in my mind are turning and this expansive subject area with such diverse and overlapping topics are starting to gel into something cohesive.

The mental a-ha! light bulbs are appearing with more frequency and the head-scratching wtf’s are becoming more scarce.

I still have a long way to go, but the more I wander around this random forest (see what I did there?) the clearer the map is becoming.

Exciting news! I signed up for the Open Data Science Conference
in Boston at the end of May.
It’ll be my first one and I have no idea what to expect.
If you’re planning to go as well, drop me a note and we’ll connect!

So here’s what transpired last month:

The Good

Stanford’s Statistical Learning

imageI wrote about Stanford’s Statistical Learning course last month, so you can read my review about it there. It was a multi-month course that kicked off in January and wrapped up in March. By far, it has been one of my favorite learning experiences. This course was a perfect mix of challenge, insight, practical tips and relevant quiz questions.

I can’t give this course a higher recommendation. If it becomes offered again, I suggest you take it.

An Introduction to Statistical Learning with Applications in R

imageThis is the textbook that accompanied the above course. It’s a  college/graduate level work that’s freely downloadable in pdf format.

Each chapter follows a similar format of presenting theory, R lab exercises to apply the theory and exercises that are a mixture of conceptual question and problems to be solved with your newly found R skills.

I enjoyed the course & videos so much, I’m reliving the experience by going slowly through the text again, refreshing on familiar concepts and taking my time on the more difficult ones. Without the deadlines, I’m able to take my time with the R labs and am learning more about this tool that with any other course I’ve taken so far. I try to work through all the exercises on my own, but if I do get stuck Amir Sadoughi has published his unofficial solutions set to this book’s exercises. Thanks, Amir!

This book covers the following topics:

  • Statistical Learning
  • Linear Regression
  • Classification
  • Resampling Methods
  • Linear Model Selection and Regularization
  • Non-linear modeling
  • Tree—Based Methods
  • Support Vector Machines
  • Unsupervised Learning

Coursera – Machine Learning Foundations: A Case Study Approach

imageIt’s a good thing Coursera offered the first week of this 6 week Machine Learning Foundations course as a free preview. Otherwise, I might have passed on this new offering from faculty at University of Washington (see previous bad experience with Data Science at Scale Specialization).

Right away, I could see this was an entirely different beast. Professionally produced and well organized, this first of a 10-course series provides a great mixture of theory and hands-on exercises using IPython Notebook and the Graphlab Create machine learning framework to explore simple to advanced concepts in Machine Learning.

This first course was an overview of the entire specialization, touching briefly on linear regression, classification, recommendation engines and deep learning for image recognition. Subsequent courses promise to attack these topics in more detail.

I suggest you take an introductory course in Python before diving into this course, though.  The DataCamp offerings (see below) would be a great place to start. While the Python used in this course isn’t advanced (and the instructor does a good job of explaining what he’s typing), its focus is Machine Learning, not Python. So you should have established a good foundation of variables, lists, dictionaries, loops, functions and other fundamentals.

You’ll also want to give yourself a little time to get the special GraphLab Create Python package installed. I’m relatively new to the world of Python and it took me a while to get everything installed properly (hint: don’t install into “Program Files” on a Windows machine – folders with spaces cause bad things).

But at the end of this course, I felt that covering some familiar topics from a different approach (Python vs R) was very productive. The hands-on image recognition coding was also really cool. I enjoyed every minute of this course and eagerly await the remaining nine!

The OK

DataCamp – Intro to Python for Data Science & Intermediate Python for Data Science

image

If you’re an absolute beginner in Python, you could benefit from the DataCamp series. The first course, Intro to Python for Data Science, assumes you’ve never programmed in another language in your life, slowly introducing you to variables and program logic.

You don’t need to have Python installed – the nice thing about the DataCamp series is that a Python IDE is built into the web page, so you can start practicing the code right away.

The mini-tutorials are very brief and the first course could be completed in one sitting (a multi-hour sitting). But if you’re experienced programming in another language (like I was in Java and R), you might find the slow pace maddening if you were just hoping to see what’s different about Python. It certainly does cover Python-specific aspects like lists and a whole section is devoted to the Numpy package – but it did feel overly simple at times. If you are new to programming, this would be the perfect place to start.

The second course, Intermediate Python for Data Science, offered only the first section for free with the remaining sections to be unlocked with a monthly subscription. The free section covers the graphing package Matplotlib and did a decent job at walking through the creation of scatterplots, lines and labels. Subsequent sections cover dictionaries, the Pandas package, control logic and loops.

qwikLABS Amazon Web Services Tutorials

imageI actively follow a number of data scientists on Twitter. One of them had shared that qwikLABS, in honor Amazon Web Services’ 10th anniversary, were offering their full collection of AWS tutorials for free throughout the month of March.

Well free was just the right price for me. I had a little bit of experience with AWS having installed an SAP HANA One instance on that platform. But I was curious to learn about the dozens of different platforms and offerings available.

The most attractive thing about qwikLABS offering was their ready-built AWS instances that you could just launch without having to have your own account, paying out-of-pocket fees for time spent learning (the Data Science at Scale course set me back an additional $50 on AWS while I struggled with outdated instructions).

The downside were the labs themselves. They were very clearly written, step-by-step guides that took you through a variety of AWS services (Creating Amazon EC2 Instances, Amazon Virtual Private Clouds, Amazon Elastic MapReduce, Google Ngrams with Amazon EMR). However, they were also… so very bland. They read like the Chemistry labs you had to trudge through in high school. And while you’re busy clicking away – entering this and that – you lose the big picture of what you’re doing and why you’re doing it. The labs feel like something you just want to get to the end of rather than something that will leaves you with a solid foundation of competency in AWS services.

Another thing that’s lacking  are any discussion forums. Have a question or encounter a problem? Tough. I don’t know what I would have done without the vibrant community of students on other platforms, helping each other over hurdles and offering advice for further learning. I had worked 40 minutes on one of the labs and encountered an error. Aside from redoing the lab (oh – did I mention the labs had a time-limit to complete before the AWS instance you’re working on shuts down, losing the work you had done up to that point?) you are on your own.

The labs are copy-protected – you can’t download, print or even screenshot (can you see the watermark in the image above?) so you will either have to type your own notes up if you want a lasting guide, or you could turn to OCR technology (Capture2Text is my favorite), but that’s a pain. Believe me – these labs aren’t so precious that they’d spread like weeds on BitTorrent.

Still – the experience of using PuTTY to connect to a Linux instance on AWS and having navigated a number of those services was a worthwhile experience.

The Ugly

Learning Bayesian Models with R

image

Confession: The whole Bayesian statistics thing confuses me. In any text or tutorial, I’m right there up to the point Bayes’ Theorem is introduced, and I’m fine with that. It’s a neat trick and I can see how it’s used with the cancer test examples that often accompany these lessons. And then blammo – a deep dive off the cliffs of advanced mathematics and I’ve lost the thread completely.

I picked up this digital book from Packt Publishing. The fact that the title included a promise to utilize R to help understand the theory was attractive.

The first few pages were good, but it doesn’t take long (page 13) to see what the expectations are for this book.

image

Make no mistake – this is a math book. Yes there are R discussions later (using the arm package), but if you make it to Chapter 5 where this occurs then you are made of tougher stuff than I.

There are exercises at the end, but sadly there are no resources anywhere offering solutions, tips or discussion boards. So if you get stuck on the problem above you’ll have to drive to the closest University and talk with their Mathematics department. So for now, I’m still in search of a good Bayesian resource.

The Postponed

imageCoursera & JHU – Capstone Project

This final step of the Johns Hopkins Data Science Specialization had promise to be an interesting project. Partnering with SwiftKey, this project is a whirlwind of Natural Language Processing and related techniques to help develop a predictive engine to anticipate a next word given a sequence of other words.

A complaint about this course is that you are left pretty much on your own to learn about Natural Language Processing. The recommended R packages and whitepapers to read were not covered in any of the previous 9 courses. But I believe that’s OK. I’ve come to appreciate that Data Science is such a vast world, to be successful you HAVE to be able to tackle a subject on your own, letting Google and YouTube be your guide. You have to bang your head on new packages and enjoy the struggle.

Sadly, I had run out of time to be able to accomplish this effectively. So I took a step back, registered for the next round of this course in April, and am taking some time to learn about NLP to improve my chances of success on this course (see Looking ahead to April below).

Learning Python

I reviewed this book last month after having read the first 18 chapters.  I still have a great impression of this book, but I started seeing advanced topics like operator overloading, package development and so on. I still don’t have enough hands-on experience with Python yet to be vested that heavily in these types of topics. So until I get to that point, I will set aside this book and return at a later month.

 

Looking ahead to April

  • Dan Jurafsky & Chris Manning: Natural Language Processing – series of YouTube videos offered by Stanford U faculty on this subject
  • An Introduction to Statistical Learning with Applications in R – hit Chapters 4 – 8
  • Udemy – Taming Big Data with Apache Spark
  • Coursera – Machine Learning – 02 – Regression – The second course in the 10 course series (discussed above)
  • Coursera JHU Data Science Specialization Capstone – Second attempt!
  • Data Science from Scratch: First Principles with Python – My second attempt with this book now that I’ve had some Python experience.

    It’ll be a busy month. But I’ve a bit more time on my hands nowadays (more on that in a blog coming soon).