How I Take Online Courses for Data Science (Part 2 – Self-quizzing)

The Struggle

In my last blog, I shared how I take notes while engaged in an MOOC for Data Science. I proceeded happily for months in this fashion, ringing each bell as I completed course after course in various specializations.

Young man, youth tired out or shattered after a hard nightAround February of this year, however, a sinking feeling was starting to settle in: I just wasn’t retaining a lot of the information I was learning. Sure I was scoring 100’s on all the quizzes and completing the assignments on-time without any issues, but I felt uneasy.

A month after I worked on the code for a gradient descent algorithm for a lab assignment, you think I had any clue what gradient descent was?

Two months after I learned how to create a support vector machine model in python, you think I recalled what library to import to even start?

Three months after I learned to separate data into training & test groups in R, you think I could remember a single command to do so?

NO – I found myself constantly having to go back to older notes for the most basic commands. I was spending all of my time on StackOverflow looking for solutions to the most basic questions (like how to reverse a Python list, how to come up with 10 random integers in R, etc). If I was to seriously work in the Data Science realm, I knew I needed to have a solid, fundamental level of proficiency with the tools and techniques I was expecting to use.

Enter Mnemosyne

Quite a while ago, I had gotten it into my mind to learn Japanese. My sole motivation: in my academic career the only thing I absolutely sucked at was foreign languages and it wasn’t for lack of effort. In my 30’s, I wanted to wipe that blemish off my record by tackling one of the hardest languages for native English-speakers pursue.

imageI ran through all 3 levels of Rosetta Stone. I listened to every minute of Pimsleur’s entire Japanese collection. I had more books & videos than I knew what to do with. The most valuable tool I used, however, was the open-source flashcard system, Mnemosyne.

I confess, I didn’t try a different flashcard programs and settle upon this one as the best. But what I did want was a tool to help me identify the concepts I was struggling with and beat me over the head with them until they became second-nature.

From their website:

Mnemosyne uses a sophisticated algorithm to schedule the best time for a card to come up for review. Difficult cards that you tend to forget quickly will be scheduled more often, while Mnemosyne won’t waste your time on things you remember well.

Continue reading

Advertisements

How I Take Online Courses for Data Science (Part 1- Note-taking)

Since June of last year, I have completed 27 online courses on various Data Science topics (see my roadmap)  and am in the midst of 3 others. I thought I’d share some of the techniques I’ve refined during this time and hope you find some nuggets that will improve your own self-paced education experience.

Time planning

Once you decide to start learning about Data Science, you soon discover there are vast amounts of resources available. I learned in my first couple of months of this journey that my appetite for knowledge far exceeded the number of hours in a day (let alone the time I had available). So at the beginning of each month, I set aside some time to plan. I create a spreadsheet grid of the courses, books and other activities I intended to pursue that month and spread the work out day-by-day.

Sometimes a task is just to spend x number of minutes on a topic. For specific courses with deadlines, I could be more precise on which lessons to watch, the estimated time to be spent and stay on track with the syllabus. I also block out the days I knew I would be traveling (as I commuted 800 miles between NY and SC!). When a specific task was accomplished, I’d color that cell green.

schedule

Yes, this may be more OCD than you’d be willing to sign up for. But in addition to keeping me from getting overloaded, there’s a certain amount of positive reinforcement in watching the green start to fill, giving me that thrill of accomplishment. And it also forces me to be realistic with what can be done that month. I’ve had to pass on courses I would have normally signed up for because I saw it just couldn’t fit into a daily schedule. Continue reading

Accomplishments – Nov 2015

Note to future self: A month with major holidays in it (i.e. Thanksgiving) is not the time to load up on four online courses simultaneously along with whatever self-learning plans you decide to make. I had to apologize for bringing my laptop to the Turkey Table, but I needed every second to get through this month! Lesson learned and I’m going to ease up a bit in December.

So – here’s how I progressed forward in November:

Coursera – Data Analysis and Statistical Inference

Let’s start with the absolute best online education experience I’ve had so far: the 9-week long soup-to-nuts introduction to Probability and Statistics given in applauseCoursera’s Data Analysis and Statistical Inference course put together by Duke University. Take a look at the Syllabus on their course page to see the list of topics covered.

This is where I’d recommend anyone who is serious about Data Science to begin. You just can’t pursue this without mathematics and this course will give you a very solid foundation.

My undergraduate degree (C’93) was Mathematics and Physics. But I learned more about statistical methods through this course than I ever did in my undergraduate and graduate studies. These 9 weeks were intense. There were 7 units worth of material, quizzes, labs, optional practice homework assignments, a mid-term exam, a final exam and a course-long project (you can read mine – Inventory Damage Analysis) to demonstrate your new-found knowledge and analyze actual data. Fortunately, there were built-in breaks (7 units over 9 weeks) that allowed you to get caught up if needed or have extra time to review the material.

The lecture videos were professionally produced (hats off to Dr. Mine Çetinkaya-Rundel and her team), the logical progression of topics felt natural, the labs & quizzes enhanced my understanding (rather than feeling trivial or unrelated) and the exams were challenging without being impossible. All in all, this is a course you definitely should take if you’re reading this blog and heading on the same journey.

Coursera – Statistical Inference

This is the 6th of 9 courses in John Hopkins’ Data Science Specialization. So far, I’ve been very happy with the courses in this series with the previous course on Reproducible Research being the highlight to date.

However, there is a bit of a stumble with the Statistical Inference class. It covered a lot of the same material as found in the Duke University course (described above) but felt very rushed and was a lot less polished.

Topics included Probability, Bayes’ Theorem, Normal & Poisson distribution, Confidence Intervals, Hypothesis tests, Power and Bootstraps methodology. Although R was used as the primary demonstration tool, make no mistake: this is a mathematics course. If I hadn’t already been deep into the Duke course, I would’ve struggled mightily with this offering.

I found the practice homework assignments with explanations very helpful for the quizzes and the assigned project did help gel a number of the concepts (you can see my submissions Exploration of the Exponential Distribution and Analysis of Tooth Growth Data).

spellCheckMy biggest complaint (aside from the rushed pace) was the sloppiness of some aspects of this course. The lecture videos did not perfectly align with the provided notes. In fact, one of the provided pdfs had several slides devoted to a topic (the Jackknife) for which there was no associated video and I ended up discovering it on YouTube. The slides themselves were riddled with spelling errors. While I understand that scientists, mathematicians and technologists are notoriously bad spellers, it was distracting and I just wanted to personally offer the instructor personal tutoring on how to use SpellCheck.

Coursera – Data Manipulation at Scale: Systems and Algorithms

I wrote an early review last month of the Data Science at Scale Specialization offering at Coursera given by University of Washington. I’ll repeat what I wrote as nothing has changed after having completed this 1st offering in their specialization.

In looking at the subject matter from the Data Science at Scale specialization (MapReduce, Hadoop, Parallelization, etc) I thought it would be a nice complement to get a “big picture” view of the current state of technology.

What you get instead is a very sloppy, typo-ridden, outdated (videos are from 2013), disjointed set of video lectures from a professor who is obviously smart & well versed in the subject, but is a lousy educator. He represents the type of professor whose class you skip because you can learn more from the book than you can from him.

The exercises have minimal direct relevance to the lecture material and assume you already have proficiency in R, SQL and Python (which, fortunately, I do – but I highly empathized with those who didn’t).

In one of the online exercises, you are asked to randomize a set of data and then provide the mean of a particular variable in that data set. Well, common sense would tell you that every one is going to get a slightly different answer. But whoever set up the online assignment thought that only HIS answer could be correct and every student on the discussion forums reported they could not get this marked correctly. Unforgivable is the fact that there was no “community TA” patrolling the message boards looking for reported issues like I saw in all of the other Coursera offerings. This entire program seems… abandoned.

Don’t waste your money on this one.

Update: Well after the deadline passed on the online exercise, the professor did get on the message boards, apologized for being too overwhelmed to pay attention to what’s happening in the course and corrected the quiz to allow for a range of means resulting from a random sampling.

He attempted to cover a lot of different technologies and techniques in this course (25 different lectures series, each with 4 – 5 videos) but ultimately failethumbs_downd to instill anything substantial. And the constant correcting of the slide’s typos, right in the middle of his lecture, was so damn… unprofessional. How hard would it be to pause the recording, correct the slides, and continue?

Hands-on Programming with R

I watched a free online webinar about R given by Garrett Grolemund and was impressed by it enough to acquire his book Hands-On Programming with R: Write Your Own Functions and Simulations. It’s a relatively short book and is meant not only for those new to R, but also those new to programming in general.

There are plenty of introduction to programming books out there that use Java,  Python or some other language as the foundation. But one doesn’t really think of R as a “language” but more as a tool to do some math stuff. Bhandsonut it is indeed a fully developed language with a rich syntax and this book fills a gap focusing on the programming aspects whereas all of the other books in my collection are about libraries, algorithms, graphics and other techniques in R.

So while the beginning of the book may be too basic for an experience programmer who knows what variables are and have written a conditional statement or two in their day, there were two sections that I found were more than worth the cost of this book:

  • An in-depth discussion of how R does its variable scoping
  • Performance tuning with a focus on vectorized processes – a concept that’s just not found in Java and other languages.

You get hands-on experience creating a couple of large, intricate programs (a deck of cards shuffler and a slot machine with complex payout rules) that make use of the concepts. Well done, Garret!

Other stuff

SAP HANA – As much as I love this technology and hoped to be able to swim in the world of this in-memory database, I learned that it was no longer in my company’s priority list for 2016. So I’ve abandoned the books I was reading and the hands-on exploration of SAP HANA One (the HANA offering on Amazon Web Services). Maybe I will pick it up again someday, but for now I dropped it to allow time for other topics.

Coursera – Practical Predictive Analytics: Models and Methods – This is the second offering in the “Data Science at Scale Specialization” that I discussed in length above. To my surprise, it started one week after the first course and I accidentally enrolled in it (rather than completing the other first and then attempting this one). It turns out the two had nothing to do with each other, with this course focusing more on Data Science techniques (Random Forests, Clustering, etc). jon-stewart-confused-what

However, like the first course, it provided very little practical education and was more of a confusing overview. At the end of the course, the final project was to work on a Kaggle dataset and submit your analysis as a pdf for review. That’s like spending hours of typo-ridden slides discussing wildlife and then sending you out to the African Safari to hunt and bring home a lion.

The only positive that came out of this was that it did introduce me to Kaggle and I’ve been having some fun there, which I will discuss next month.

Doing Data Science: Straight Talk from the Frontline – This book has the potential to become one of my favoritesdoingdatascience. However, after Chapter 4, I realized I was not technically or mathematically ready for it.

It’s audience is the one whose gone through extensive studying on the subject already and has some hands-on experience with R and various analytic techniques (even if just at an academic level). It’s not a tutorial or a reference book. It’s a frank discussion about what its like to do Data Science in the real world with practical (and somewhat advanced) discussions of the methodology used.

I loved the blunt style of the authors (Cathy O’Neil, Rachel Schutt) that make you feel like you’re in a coffee shop with them and talking plain dirt about how your day went. What sold me on this book was the following sidebar on page 56 that I landed on while flipping through the pages:

WTF: So Is It an Algorithm or a Model?

Sold. I will be picking up this book again next year when I’m better prepared to do so.