Simple file scraper in R

I’ve been in IT for 20 years now and I’m happy that the thrill of accomplishing something with a new tool has not left me.

I’ve been spending a lot of time the past few months learning and swimming in R. Yesterday, an opportunity presented itself at work where I wanted extract customer numbers from over 13,000 order files I had sitting in a directory.

My bread & butter in the past has been either Java or (God help me) Microsoft Access. I could’ve written something rather quickly in either of those tools to accomplish what I wanted.

But I subscribe to the “use-it-or-lose-it” philosophy and decided to put my knowledge of R to the test. Fortunately, the customer number I was after was located in the same row, same columns in each file so I was spared having to do any serious text parsing (grep & regex still eludes me).

The below script scanned through all 13,000 files and produced what I needed in less than 30 seconds.

setwd("C:/Users/Bill Kimler/Documents/Orders")

files <- list.files("./OrderHistory")

customerNumbers <- vector()

for (i in 1:length(files)){
 openFile <- file(paste0("./OrderHistory/",files[i]))
 customer <- substring(readLines(openFile, n = 5)[4], 38, 47)
 customerNumbers <- append(customerNumbers, customer)

I’m happy that I’ve finally put R to use at work. I’m still a far cry from doing serious data analysis with it, but baby steps, man. Baby steps.

update: Nov 11, 2015

In a discussion forum in the Reproducible Research course I took in November, someone posed the question: Is anyone else already using R at work?

I replied with a link this blog as a timely example of where I indeed had.

Then a beautiful thing happened. A TA of the course replied with some very constructive feedback that propelled me forward in my understanding of sapply and the utilization of custom functions. I sure he won’t mind me posting it here.

Bill, I think it is great you are learning by doing. In a spirit of “subtitles you might want to be aware of as you do more advanced things” this version might be a bit faster

setwd("C:/Users/Bill Kimler/Documents/Orders")
files <- list.files("./OrderHistory")
extractCustomers <- fuction (filename){
openFile <- file(paste0("./OrderHistory/",files[i]))
customer <- substring(readLines(openFile, n = 5)[4], 38, 47)
customerNumbers <- sapply(files, extractCustomers)

The key difference between the two is that using the for loop with append, R has no idea how much memory to allocated for the final object, so as it gets up in size it has to copy the object to a larger block, and that can significantly slow things down as the data set gets really big, as it has to copy everything to date plus some room several times.

In contrast, using the apply family, it is effectively able to go “I will be doing the same thing on 13000 files and making the result into a vector, I had better allocate memory for a vector of 13000 at the start).

It may well not make much difference in this case, as a lot of the running time is file I/O which will be the same in both cases (it still has to read 13000 files either way) but as the data size gets huge, thoughts like this can make a noticeable difference. In the capstone, writing efficient code can mean the difference between processing the data taking 14 hours, and processing the data taking 18 minutes. If it takes 18 minutes, that gives you a lot more scope to play with options for analysis.

What a beautiful human being! I couldn’t be more grateful and hope to be in a position someday to likewise help someone in a similar fashion.


Accomplishments – September 2015

August was a bit “light” for summer vacation reasons, but I attacked September with renewed vigor! I look back to when I started this journey in June, and I marvel at how much I’ve learned since thing. And I’ve also been humbled at how much more there is to learn (years’ worth!). However, in the spirit of documenting what’s been accomplished, here’s the record for September 2015:

Coursera – Exploratory Data Analysis

This is the fourth course in the Coursera Data Science Specialization track and by far, the most enlightening one so far. The initial exploration of a data set is vital to future insights. In addition to learning about the various R graphics packages (especially ggplot)  an introduction to clustering (i.e. k-clusters) provided a taste of things to come.

Data Science requires hard work. It’s not a simple set of skills that can easily be achieved by a set of short tutorials. The last part of this course on clustering gives a taste of that fact. You will need an understanding of some higher level mathematics (statistics and linear algebra) in order to make use of the tools and algorithms that have been developed. No pain, no gain, folks. If you’re not willing to sweat for your eigenvalues, then get out of the gym!

Coursera – Machine Learning

Soctavepeaking of mental stretches, this 11-week course on Machine Learning was one of the most challenging I’ve encountered since graduate school It was heavy on the mathematics (especially Linear Algebra) and introduced me to a new mathematics software package called Octave (an open source rendition of MatLab).

This has been the most challenging course I’ve taken so far. The professor recorded 110 lecture videos on a wide variety of topics from basic linear algebra to pattern recognition in photographs to determine numeric digits. You will not be an expert in the subject after this course (after all, are you an expert in Physics after a single class?). But the notes I took and the challenging quizzes and lab exercises this course demanded will provide a wealth of material that I will be referring to for years to come!

Make no mistake – this is an advanced course. But this is where modern Data Science is at. You need to know this material if you’re serious about the subject.

The two items above are what were completed in September. But for the record, I pursued a number of other tracks throughout the month. I’ll write more about each in the month I completed them, but briefly, this month also contained daily work in:

Doing Data Science – What a great book written by two women who developed a course on Data Science at Columbia University! This book feels like a friend or coworker who pulls me aside and says, “Here’s what Data Science is really about.”

Data Smart – I’m working through this book for a second time. I’ve read the book already, but now I’m going back to the beginning and working through every Excel example, taking detailed notes on every step. Chapter 2 on K-means clustering makes so much more sense now, especially tied in the with Exploratory Data Analysis course as described above.

LumiraSAP HANA Administration and Getting Started with SAP Lumira – I’ve read more than halfway through both of these books in September. HANA as an in-memory, lightening quick database and Lumira as one of the coolest interactive reporting platforms I’ve ever gotten my hands on.

I gave a demonstration this morning to my company’s CEO of a HANA dataset consisting of about a million records that were sliced & diced in no time at all using Lumira’s beautiful & intuitive graph development platform. I’ve been working with data & reporting for 20 years now, and I still am smiling from this leap in technology.

Coursera – Data Analysis and Statistical Inference – Finally, I began a new course that reinforced the fundamentals of probability and statistics. So far, basic statistics (mean, variance, normal distribution, Bernoulli distribution) as well as Baye’s Theorem and hypothesis testing were covered in detail in the first three weeks. This foundation of statistics is essential to further progress in Data Science. And so far, this has been the highest quality of any Coursera course I’ve worked through.