I’ve been in IT for 20 years now and I’m happy that the thrill of accomplishing something with a new tool has not left me.
I’ve been spending a lot of time the past few months learning and swimming in R. Yesterday, an opportunity presented itself at work where I wanted extract customer numbers from over 13,000 order files I had sitting in a directory.
My bread & butter in the past has been either Java or (God help me) Microsoft Access. I could’ve written something rather quickly in either of those tools to accomplish what I wanted.
But I subscribe to the “use-it-or-lose-it” philosophy and decided to put my knowledge of R to the test. Fortunately, the customer number I was after was located in the same row, same columns in each file so I was spared having to do any serious text parsing (grep & regex still eludes me).
The below script scanned through all 13,000 files and produced what I needed in less than 30 seconds.
setwd("C:/Users/Bill Kimler/Documents/Orders") files <- list.files("./OrderHistory") customerNumbers <- vector() for (i in 1:length(files)){ openFile <- file(paste0("./OrderHistory/",files[i])) customer <- substring(readLines(openFile, n = 5)[4], 38, 47) close(openFile) customerNumbers <- append(customerNumbers, customer) }
I’m happy that I’ve finally put R to use at work. I’m still a far cry from doing serious data analysis with it, but baby steps, man. Baby steps.
update: Nov 11, 2015
In a discussion forum in the Reproducible Research course I took in November, someone posed the question: Is anyone else already using R at work?
I replied with a link this blog as a timely example of where I indeed had.
Then a beautiful thing happened. A TA of the course replied with some very constructive feedback that propelled me forward in my understanding of sapply and the utilization of custom functions. I sure he won’t mind me posting it here.
Bill, I think it is great you are learning by doing. In a spirit of “subtitles you might want to be aware of as you do more advanced things” this version might be a bit faster
setwd("C:/Users/Bill Kimler/Documents/Orders") files <- list.files("./OrderHistory") extractCustomers <- fuction (filename){ openFile <- file(paste0("./OrderHistory/",files[i])) customer <- substring(readLines(openFile, n = 5)[4], 38, 47) close(openFile) return(customer) }customerNumbers <- sapply(files, extractCustomers)The key difference between the two is that using the for loop with append, R has no idea how much memory to allocated for the final object, so as it gets up in size it has to copy the object to a larger block, and that can significantly slow things down as the data set gets really big, as it has to copy everything to date plus some room several times.
In contrast, using the apply family, it is effectively able to go “I will be doing the same thing on 13000 files and making the result into a vector, I had better allocate memory for a vector of 13000 at the start).
It may well not make much difference in this case, as a lot of the running time is file I/O which will be the same in both cases (it still has to read 13000 files either way) but as the data size gets huge, thoughts like this can make a noticeable difference. In the capstone, writing efficient code can mean the difference between processing the data taking 14 hours, and processing the data taking 18 minutes. If it takes 18 minutes, that gives you a lot more scope to play with options for analysis.
What a beautiful human being! I couldn’t be more grateful and hope to be in a position someday to likewise help someone in a similar fashion.