How I Take Online Courses for Data Science (Part 2 – Self-quizzing)

August 12, 2016August 12, 2016 dreamingofdata Learning

The Struggle

In my last blog, I shared how I take notes while engaged in an MOOC for Data Science. I proceeded happily for months in this fashion, ringing each bell as I completed course after course in various specializations.

Around February of this year, however, a sinking feeling was starting to settle in: I just wasn’t retaining a lot of the information I was learning. Sure I was scoring 100’s on all the quizzes and completing the assignments on-time without any issues, but I felt uneasy.

A month after I worked on the code for a gradient descent algorithm for a lab assignment, you think I had any clue what gradient descent was?

Two months after I learned how to create a support vector machine model in python, you think I recalled what library to import to even start?

Three months after I learned to separate data into training & test groups in R, you think I could remember a single command to do so?

NO – I found myself constantly having to go back to older notes for the most basic commands. I was spending all of my time on StackOverflow looking for solutions to the most basic questions (like how to reverse a Python list, how to come up with 10 random integers in R, etc). If I was to seriously work in the Data Science realm, I knew I needed to have a solid, fundamental level of proficiency with the tools and techniques I was expecting to use.

Enter Mnemosyne

Quite a while ago, I had gotten it into my mind to learn Japanese. My sole motivation: in my academic career the only thing I absolutely sucked at was foreign languages and it wasn’t for lack of effort. In my 30’s, I wanted to wipe that blemish off my record by tackling one of the hardest languages for native English-speakers pursue.

I ran through all 3 levels of Rosetta Stone. I listened to every minute of Pimsleur’s entire Japanese collection. I had more books & videos than I knew what to do with. The most valuable tool I used, however, was the open-source flashcard system, Mnemosyne.

I confess, I didn’t try a different flashcard programs and settle upon this one as the best. But what I did want was a tool to help me identify the concepts I was struggling with and beat me over the head with them until they became second-nature.

From their website:

Mnemosyne uses a sophisticated algorithm to schedule the best time for a card to come up for review. Difficult cards that you tend to forget quickly will be scheduled more often, while Mnemosyne won’t waste your time on things you remember well.

Continue reading →

How I Take Online Courses for Data Science (Part 1- Note-taking)

August 10, 2016August 11, 2016 dreamingofdata Learning

Since June of last year, I have completed 27 online courses on various Data Science topics (see my roadmap) and am in the midst of 3 others. I thought I’d share some of the techniques I’ve refined during this time and hope you find some nuggets that will improve your own self-paced education experience.

Time planning

Once you decide to start learning about Data Science, you soon discover there are vast amounts of resources available. I learned in my first couple of months of this journey that my appetite for knowledge far exceeded the number of hours in a day (let alone the time I had available). So at the beginning of each month, I set aside some time to plan. I create a spreadsheet grid of the courses, books and other activities I intended to pursue that month and spread the work out day-by-day.

Sometimes a task is just to spend x number of minutes on a topic. For specific courses with deadlines, I could be more precise on which lessons to watch, the estimated time to be spent and stay on track with the syllabus. I also block out the days I knew I would be traveling (as I commuted 800 miles between NY and SC!). When a specific task was accomplished, I’d color that cell green.

Yes, this may be more OCD than you’d be willing to sign up for. But in addition to keeping me from getting overloaded, there’s a certain amount of positive reinforcement in watching the green start to fill, giving me that thrill of accomplishment. And it also forces me to be realistic with what can be done that month. I’ve had to pass on courses I would have normally signed up for because I saw it just couldn’t fit into a daily schedule. Continue reading →

A Fun Little Success Story in R

July 14, 2016July 14, 2016 dreamingofdata R

I was up late last night having more fun than I’ve had in a while. We can debate our respective definitions of “fun”, but for me it’s setting myself against a programming challenge, enjoying the struggle as much as the eventual achievement.

I’m no R expert, but last night’s work is a sign that I’m heading in the right direction. It’s not reflective so much on my skill set but rather on how much I enjoyed it. This is what I’m meant to be doing!

The Setup

I’m working on a project involving word prediction. The goal is to develop an app that can predict the next word you may type based on the previous word. We’re given a rather large corpus of texts, containing hundreds of thousands of tweets, blogs and news articles. From that corpus, we need to train a model that finds the most likely words to appear after any chosen word.

The whole art of Natural Language Processing and the packages that have been developed to work in this area are too complex to get into here. Instead, I want to jump to the point after I’ve processed this corpus and have a collection of 2-word n-grams and their frequency in the corpus. For example:

how_are	23	sometimes_I	26
how_is	18	sometimes_but	13
how_about	10	sometimes_.	16

In this simplified collection, a user typing in how would have are, is and about presented as options. Typing in sometimes would present the user with I, . and but. There’s more complexity to this, but it’s sufficient for the next stage.

So at this stage, I am in possession of the following:

first_words – a vector of words that appeared first in the 2-word n-gram. (how, how, how, sometimes, sometimes, sometimes)
second_words – a vector of words that appeared second in the 2-word n-gram. (are, is, about, I, but, .)
ngram_counts – a vector of counts associated with the word pairs. (23, 18, 10, 26, 13, 16)

These vectors are aligned so that the same index would be associated with the same n-gram and its count.

Attempt #1 – A simple matrix

My initial idea was to create a matrix with named rows and columns with counts as values so that a quick lookup could be done. For example M[‘how’, ‘are’] would yield 23.