I was up late last night having more fun than I’ve had in a while. We can debate our respective definitions of “fun”, but for me it’s setting myself against a programming challenge, enjoying the struggle as much as the eventual achievement.
I’m no R expert, but last night’s work is a sign that I’m heading in the right direction. It’s not reflective so much on my skill set but rather on how much I enjoyed it. This is what I’m meant to be doing!
I’m working on a project involving word prediction. The goal is to develop an app that can predict the next word you may type based on the previous word. We’re given a rather large corpus of texts, containing hundreds of thousands of tweets, blogs and news articles. From that corpus, we need to train a model that finds the most likely words to appear after any chosen word.
The whole art of Natural Language Processing and the packages that have been developed to work in this area are too complex to get into here. Instead, I want to jump to the point after I’ve processed this corpus and have a collection of 2-word n-grams and their frequency in the corpus. For example:
In this simplified collection, a user typing in how would have are, is and about presented as options. Typing in sometimes would present the user with I, . and but. There’s more complexity to this, but it’s sufficient for the next stage.
So at this stage, I am in possession of the following:
- first_words – a vector of words that appeared first in the 2-word n-gram. (how, how, how, sometimes, sometimes, sometimes)
- second_words – a vector of words that appeared second in the 2-word n-gram. (are, is, about, I, but, .)
- ngram_counts – a vector of counts associated with the word pairs. (23, 18, 10, 26, 13, 16)
These vectors are aligned so that the same index would be associated with the same n-gram and its count.
Attempt #1 – A simple matrix
My initial idea was to create a matrix with named rows and columns with counts as values so that a quick lookup could be done. For example M[‘how’, ‘are’] would yield 23.