A Fun Little Success Story in R

I was up late last night having more fun than I’ve had in a while. We can debate our respective definitions of “fun”, but for me it’s setting myself against a programming challenge, enjoying the struggle as much as the eventual achievement.

I’m no R expert, but last night’s work is a sign that I’m heading in the right direction. It’s not reflective so much on my skill set but rather on how much I enjoyed it. This is what I’m meant to be doing!

The Setup

I’m working on a project involving word prediction. The goal is to develop an app that can predict the next word you may type based on the previous word. We’re given a rather large corpus of texts, containing hundreds of thousands of tweets, blogs and news articles. From that corpus, we need to train a model that finds the most likely words to appear after any chosen word.

The whole art of Natural Language Processing and the packages that have been developed to work in this area are too complex to get into here. Instead, I want to jump to the point after I’ve processed this corpus and have a collection of  2-word n-grams and their frequency in the corpus. For example:

how_are 23 sometimes_I 26
how_is 18 sometimes_but 13
how_about 10 sometimes_. 16

In this simplified collection, a user typing in how would have are, is and about presented as options. Typing in sometimes would present the user with I, . and but. There’s more complexity to this, but it’s sufficient for the next stage.

So at this stage, I am in possession of the following:

  • first_words – a vector of words that appeared first in the 2-word n-gram. (how, how, how, sometimes, sometimes, sometimes)
  • second_words – a vector of words that appeared second in the 2-word n-gram. (are, is, about, I, but, .)
  • ngram_counts – a vector of counts associated with the word pairs. (23, 18, 10, 26, 13, 16)

These vectors are aligned so that the same index would be associated with the same n-gram and its count.

Attempt #1 – A simple matrix

My initial idea was to create a matrix with named rows and columns with counts as values so that a quick lookup could be done.  For example M[‘how’, ‘are’] would yield 23.

ngrams_count <- length(first_words)
unique_first_words <- unique(first_words)
unique_second_words <- unique(second_words)

M <- matrix(0, 
            nrow = length(unique_first_words), 
            ncol = length(unique_second_words),
            dimnames = list(unique_first_words, unique_second_words))

for (i in 1:ngrams_count) {
   M[first_words[i], second_words[i]] <- ngram_counts[i]
} 

The columns and rows would represent the unique values of words found. So with the simple example above, we’d end up with a 2 x 6 matrix, with how and sometimes as the rows and the six 2nd words as the columns. Once the matrix has been initialized, we can walk through all of the word pairs and place their count in the right element of the matrix. I recognized right away that this is old-school coding, especially using the for loop in a brute-force assignment of values. But I needed a starting point and this is where I set my flag.

I created some statistics for this process to see how the matrix grew as I processed more items in the corpus, which had a total of 3.37 million documents in it (each document is a tweet, news article or blog post).

image

Two things became obvious as the number of docs processed increased:

  • I was going to run out of memory soon. The amount of memory consumed by the matrix (object.size(M)) was going to quickly grow beyond the means of my computer.  Just over 1/10 of 1% of the documents processed and I was already knocking on 1 GB of memory
  • I was going to grow old waiting for the process to complete. 54 seconds isn’t a lot, but when the time was growing exponentially (along with everything else), I wasn’t going to get much further before I found myself waiting hours (assuming memory crashes didn’t occur first).

Attempt #2 – Speed up matrix data assignments

As I learned from a comment made to one of my posts from last year, a lot a speed can be made through vectorizing your code and for loops are the first place to look.

Fortunately, there’s indeed is a way to vectorize the population of data in a matrix. Instead of looping through each n-gram one at a time, I can do the assignment in one shot with the following:

M[cbind(first_words, second_words)] = ngram_counts

I’m basically inserting a matrix of indices right into M with the first column being the values of first words and the second column the values of the second words.  The ngram_counts values fall right in line.

So instead of creating a loop that first assigns M[‘how’, ‘are’]=23, M[‘how’, ‘is’]=18, etc, it does it in one shot (please forgive the formatting below):

M[[1,] “how” “are”  = [23
  [2,] “how” “is” ]    18]

Watch what happens to the processing speeds:

image

For 5000 documents, processing time dropped from 54 seconds down to 3 seconds! Progress!

I plowed forward with 10,000 documents but had to stop after that step once I saw the memory used by the matrix had reached 1.6 GB. I have 16 GB memory on my machine, only a portion of which is available for non-OS related processes. And I’ve only gotten 0.3% the way through the corpus! Time to address the memory issue

Attempt #3 – Wielding the Sparse Matrix

The reason why the matrix() plan wasn’t going to well is pretty obvious. In the last line above with 10,000 documents, a matrix was constructed to handle 14,360 x 14,950 potential word pairs. That’s almost 215 million possible combinations. However, there were only 71,000 two-word n-grams that were actually seen, so that means 99.97% of the matrix were just placeholders with a 0-value. That’s a LOT of wasted storage space.

When you begin R (or any programming language for that matter), you realize that you are very unlikely to encounter a problem that hasn’t been seen a dozen times before. In this case, the makers of the Matrix package created a SparseMatrix object that doesn’t consume any memory for values not assigned.

  library(Matrix)
  sM <- sparseMatrix(i=1, j=1, x=0, 
                     dims =c(length(unique_first_words), 
                             length(unique_second_words)),
                     dimnames = list(unique_first_words, unique_second_words))

A SparseMatrix can work just like a regular matrix (i.e. sM[‘how’, ‘are’]=23). But instead of filling non-matching word pairs with 0, a SparseMatrix stores nothing.

i and j are supposed to be vectors of integers for which values actually exist and x a vector of the values themselves. I couldn’t figure out how to pass the indexes of the unique words vectors I had set up (i.e. send through [1,3] instead of [‘how’, ‘about’]), but I learned that could just send through a default value of 0 for the first cell [0,0] and it would automatically create the correct dimension matrix filled with nothing based on the dims passed (except for the first cell which is now 0). I can then assign values as needed in the next step.

So with the SparseMatrix in hand, I set about to assigned the values in a vectorized manner just like above.

sM[cbind(first_words, second_words)] = ngram_counts

Error in .TM.repl.i.mat(as(x, "TsparseMatrix"), i = i, value = value) : 
  such indexing must be by logical or 2-column numeric matrix

WHAT??? I can’t use the vectorized technique here? Well, hopefully with memory efficiency, the speed of a for loop would be improved:

for (i in 1:ngrams_count) {
    sM[first_words[i], second_words[i]] <- ngram_counts[i]
  } 

Well, as you see above, memory indeed was improved tremendously. For 5,000 documents processed, only 1.5 MB were consumed with the SparseMatrix vs over 687 MB for a regular matrix. But the cost on time spent was brutal. The for loop nearly doubled in time when compared to my first attempt. So while I wouldn’t be blowing up my computer’s memory, I certainly would be growing cobwebs waiting for the data to fill.

Attempt #4 – data.table

At this point, I abandoned the matrix approach and pursued a more database-familiar process. The data.table package is a nifty one I came across in one of the JHU Coursera classes. It operates as an in-memory table with quick lookup capabilities. For example, I can find all of the ‘how’ records with

dt[first_word=='how',]

   first_word second_word count
1:        how         are    23
2:        how          is    18
1:        how       about    10

So here’s how I utilized a data.table and yes, the data assignment step can be vectorized!

library(data.table)
dt <- data.table(first_words, second_words, ngram_counts)
names(dt) <- c("first_word", "second_word", "count")
setkey(dt, first_word)

How did this method’s statistics fair?

image

Not only were the times incredibly fast, but the memory allocation was incredibly small! I was so emboldened by the success here, that I processed 337,000 documents (10% of the corpus) and the resulting data.table storage was a meager 40.4 MB!

imageLast night, around 11 pm, I was quite the sight to see, dancing around, pumping my fist in the air, high-fiving… no one.

But still, it was a very good night!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s