A Fun Little Success Story in R

I was up late last night having more fun than I’ve had in a while. We can debate our respective definitions of “fun”, but for me it’s setting myself against a programming challenge, enjoying the struggle as much as the eventual achievement.

I’m no R expert, but last night’s work is a sign that I’m heading in the right direction. It’s not reflective so much on my skill set but rather on how much I enjoyed it. This is what I’m meant to be doing!

The Setup

I’m working on a project involving word prediction. The goal is to develop an app that can predict the next word you may type based on the previous word. We’re given a rather large corpus of texts, containing hundreds of thousands of tweets, blogs and news articles. From that corpus, we need to train a model that finds the most likely words to appear after any chosen word.

The whole art of Natural Language Processing and the packages that have been developed to work in this area are too complex to get into here. Instead, I want to jump to the point after I’ve processed this corpus and have a collection of  2-word n-grams and their frequency in the corpus. For example:

how_are 23 sometimes_I 26
how_is 18 sometimes_but 13
how_about 10 sometimes_. 16

In this simplified collection, a user typing in how would have are, is and about presented as options. Typing in sometimes would present the user with I, . and but. There’s more complexity to this, but it’s sufficient for the next stage.

So at this stage, I am in possession of the following:

  • first_words – a vector of words that appeared first in the 2-word n-gram. (how, how, how, sometimes, sometimes, sometimes)
  • second_words – a vector of words that appeared second in the 2-word n-gram. (are, is, about, I, but, .)
  • ngram_counts – a vector of counts associated with the word pairs. (23, 18, 10, 26, 13, 16)

These vectors are aligned so that the same index would be associated with the same n-gram and its count.

Attempt #1 – A simple matrix

My initial idea was to create a matrix with named rows and columns with counts as values so that a quick lookup could be done.  For example M[‘how’, ‘are’] would yield 23.

Continue reading

Advertisements

Simple file scraper in R

I’ve been in IT for 20 years now and I’m happy that the thrill of accomplishing something with a new tool has not left me.

I’ve been spending a lot of time the past few months learning and swimming in R. Yesterday, an opportunity presented itself at work where I wanted extract customer numbers from over 13,000 order files I had sitting in a directory.

My bread & butter in the past has been either Java or (God help me) Microsoft Access. I could’ve written something rather quickly in either of those tools to accomplish what I wanted.

But I subscribe to the “use-it-or-lose-it” philosophy and decided to put my knowledge of R to the test. Fortunately, the customer number I was after was located in the same row, same columns in each file so I was spared having to do any serious text parsing (grep & regex still eludes me).

The below script scanned through all 13,000 files and produced what I needed in less than 30 seconds.

setwd("C:/Users/Bill Kimler/Documents/Orders")

files <- list.files("./OrderHistory")

customerNumbers <- vector()

for (i in 1:length(files)){
 openFile <- file(paste0("./OrderHistory/",files[i]))
 customer <- substring(readLines(openFile, n = 5)[4], 38, 47)
 close(openFile)
 customerNumbers <- append(customerNumbers, customer)
}

I’m happy that I’ve finally put R to use at work. I’m still a far cry from doing serious data analysis with it, but baby steps, man. Baby steps.


update: Nov 11, 2015

In a discussion forum in the Reproducible Research course I took in November, someone posed the question: Is anyone else already using R at work?

I replied with a link this blog as a timely example of where I indeed had.

Then a beautiful thing happened. A TA of the course replied with some very constructive feedback that propelled me forward in my understanding of sapply and the utilization of custom functions. I sure he won’t mind me posting it here.

Bill, I think it is great you are learning by doing. In a spirit of “subtitles you might want to be aware of as you do more advanced things” this version might be a bit faster

setwd("C:/Users/Bill Kimler/Documents/Orders")
files <- list.files("./OrderHistory")
extractCustomers <- fuction (filename){
openFile <- file(paste0("./OrderHistory/",files[i]))
customer <- substring(readLines(openFile, n = 5)[4], 38, 47)
close(openFile)
return(customer)
}
customerNumbers <- sapply(files, extractCustomers)

The key difference between the two is that using the for loop with append, R has no idea how much memory to allocated for the final object, so as it gets up in size it has to copy the object to a larger block, and that can significantly slow things down as the data set gets really big, as it has to copy everything to date plus some room several times.

In contrast, using the apply family, it is effectively able to go “I will be doing the same thing on 13000 files and making the result into a vector, I had better allocate memory for a vector of 13000 at the start).

It may well not make much difference in this case, as a lot of the running time is file I/O which will be the same in both cases (it still has to read 13000 files either way) but as the data size gets huge, thoughts like this can make a noticeable difference. In the capstone, writing efficient code can mean the difference between processing the data taking 14 hours, and processing the data taking 18 minutes. If it takes 18 minutes, that gives you a lot more scope to play with options for analysis.

What a beautiful human being! I couldn’t be more grateful and hope to be in a position someday to likewise help someone in a similar fashion.

zoRk – using the swirl package to learn R

You discover early on that you need to learn R if you’re going to explore the realm of data science.

So how to learn? I started by downloading R and R Studio and installing them on my Windows laptop. I acquired a couple of books on R from Amazon – but found that I needed some more basic, hands-on introduction to R before I could really absorb what those books covered.

As I started taking the Coursera R Programming course, they recommended that I should simultaneously work through the R lessons offered in swirl – “a software package for the R programming language that turns the R console into an interactive learning environment.”

You basically learn R through R using R.

I installed it, loaded the library and with a simple

swirl()

I was on my way. And I fell in love with this way of learning R.

It’s not the only way, for sure. I always love to learn from books and I will be back to my Amazon purchases shortly. The Coursera lectures are good and provide excellent theoretical background on how R works (I just finished the brain-stretching lectures on Scoping Rules).

But for just starting out with R, nothing beats the simplistic, no-frills, text-only, hands-on methodology of swirl.

swirl1

swirl offers a number of different courses but I’m as green as green can be, so I’m working through “R Programming”.

swirl2The lessons are short (10 – 20 minutes each) and very logical in their progression of concepts. I come from an education background (a former Physics teacher) and I was as passionate about the pedagogy as I was about the science. Whoever authored “R Programming” remembers what it was like to start from the beginning. Everything is based on founding principles and builds upon concepts without belaboring points (as Khan Academy is apt to do on occasion).

Without moving videos, fancy graphics, voice-overs and more importantly, the emphasis of hands-on practice as you learn, swirl is a very distraction-free environment to learn R.

You know what it reminds me of? (You have to be a child of the 80’s to appreciate this.)

Zork