I’m a little delayed in getting my October review recorded here. I’m not sure what I was thinking, but here in November I find myself overloaded taking 4 courses at the same time as well as working through two books. I’m looking forward to a little bit of relaxation in December, but Thanksgiving week will be a nailbiter as projects/exams for these 4 courses come due.
Coursera – Reproducible Research
The fifth course in Johns Hopkins Data Science Specialization track was a little bit of a breather. The focus here was in the methodical recording of all steps taken while performing a formal data analysis so that others can follow in your footsteps. In my former life as a scientist (physics) this type of thought process is very natural to me: A full disclosure of all methods, assumptions and calculations. But I see every day in business how this thought process is completely lacking. How many reports are written & distributed without any supporting documentation about assumptions, how data is gathered and cleaned, etc?
Various documentation techniques were covered in this course (we got a heavy dose of markdown and knitr) to produce documents that execute the code contained within them! Very cool technology.
What I will remember most is the following video that shatters the myth of scientists being a cooperative, open, self-correcting bunch of intellectuals. Instead, as seen by the Duke Cancer Research Scandal, this can be very far from the truth. This video is a riveting 30 minute presentation on the behind-the scenes attempt to reproduce the results Duke researchers had published and the labyrinth of issues they had to navigate in attempting to do so.
SAP HANA books
Although my 2-year-old cheap tablet has a battery life of about 90 minutes, I was able to completely reach electronic cover to electronic cover the following two books:
- Getting Started with SAP Lumira – Lumira is a very night analytics visualization tool that wonderfully complements the high-speed data management of SAP HANA. It’s produces gorgeous images in a very intuitive fashion. I recommend this book.
- SAP HANA Administration – A bit tougher to read, but I slogged through it to get a better understanding of HANA. I think it got me about 25% there.
During one of the JHU Data Science Courses, the topic of grep and regex came up. I’ve had very little exposure to this very useful technique of searching for text, so I worked through a very nice Regex tutorial over at regexone.com. 16 self-paced lessons with plenty of interactive problems to test out your skills make this a highly recommended place to go.
Coursera – Data Analysis and Statistical Inference – I started this course in September and it wraps up at the end of November. But what an amazing course! Better than some college classes I paid thousands of dollars to take. It’s far from trivial, but definitely accessible to the non-calculus student.
Data Smart – Continuing my second pass through this book and actually working through all of the Excel examples is giving me a deep appreciation for what’s happening under the hood in some of these R functions. John Foreman works you through the theory and practical implementation in Excel of a number of the common Data Science techniques I’ve been reading about (K-means clustering, Optimization modeling, Network Graphs, etc). And after spending hours on the spreadsheet, only to learn you can get the same result with one line of R code… Well, that just means you’ve earned the right to use the R code. I’m halfway through this book and can’t recommend it more highly.
Coursera – Data Science at Scale Specialization – Lest you think I write this blog just to give a thumbs-up to everything, let me emphatically NOT endorse this specialization from University of Washington. This 3-course + Capstone project offering covers a lot of topics that aren’t in the other Data Science specialization from Johns Hopkins. The JHU specializations stays solely within the realm of R and focuses on the practical, beginner techniques of modern data analysis.
In looking at the subject matter from the Data Science at Scale specialization (MapReduce, Hadoop, Parallelization, etc) I thought it would be a nice complement to get a “big picture” view of the current state of technology.
What you get instead is a very sloppy, typo-ridden, outdated (videos are from 2013), disjointed set of video lectures from a professor who is obviously smart & well versed in the subject, but is a lousy educator. He represents the type of professor whose class you skip because you can learn more from the book than you can from him.
The exercises have minimal direct relevance to the lecture material and assume you already have proficiency in R, SQL and Python (which, fortunately, I do – but I highly empathized with those who didn’t).
In one of the online exercises, you are asked to randomize a set of data and then provide the mean of a particular variable in that data set. Well, common sense would tell you that every one is going to get a slightly different answer. But whoever set up the online assignment thought that only HIS answer could be correct and every student on the discussion forums reported they could not get this marked correctly. Unforgivable is the fact that there was no “community TA” patrolling the message boards looking for reported issues like I saw in all of the other Coursera offerings. This entire program seems… abandoned.
Don’t waste your money on this one.