Accomplishments – August 2015

Third month of the journey and the jumble of topics that Data Science encompasses started to become clearer to me.

Coursera – Programming for Everybody (Python)

DataScienceFromScratchThe first Data Science book I grabbed in June at the good old Barnes & Nobel was Data Science from Scratch by Joel Grus. I figured the title was just right for me, a beginner just starting out. I was wrong.

Two things became clear after the 3rd chapter:

  • I needed to brush up on my math. Sure, I was working through the examples and seeing results on the screen – but other than aesthetic appeal, I was not processing what I was accomplishing.
  • Python is the language of choice here. There was an earlier chapter devoted to giving a whirlwind tour of Python, but it just wasn’t enough to make me comfortable being able to comprehend the concepts while I was battling syntax.

So I set the book aside and took the 10 week intro course: Programming for Everybody (Python). Being this was an introduction to programming and as I’m an experienced Java developer, I did find about half the course to be redundant (covering basics like variables, loops, if-then, etc). I also found the instructor to be quite dorky – not in an adorable way, but in an annoying, trying-to-hard-to-be-funny-and-nerdy sort of way that made some of the lectures longer than it needed to be. You’ll see what I mean when he tries on a wizard outfit – or when he stops a lecture to brew some tea.

Still, it was a good introduction to Python. There were relevant quizzes to reinforce key points, and small programming exercises that allowed you to flex the skills you developed.

However, after completing this course, I realized that I could not pursue parallel tracks learning both Python and R at the same time – at least not at this beginner stage. So here ends my Python journey for now. I’ll pick it back up next year.

Coursera – R Programming

The second course of the Coursera/Johns Hopkins Data Science Specialtization Track, I got my hands dirty with the tool that I’ve been using every day since. This 4 week intro that I felt provided a sufficient foundation for what was to come after.

Topics included:

  • Installing R & various packages
  • Data types
  • Reading in data
  • Subsetting data
  • Creating your own functions
  • Code optimizations and advanced topics like scoping

Overall, it was a good course and the quizzes and programming assignments were challenging. Although I’m still not even halfway through the program, I feel they could have spent more time on R fundamentals and saved the advanced topics like scoping and performance for a later date – once the student has had more hands-on experience with R. At this point, all of that is a distant memory.

Coursera – Getting and Cleaning Data

This was the third course in the Specialization Track and so far, I feel like I’ve been getting more than my money’s worth on this program.  Getting and Cleaning Data exists in a world that I’m very familiar with – the world of crappy, hard-to-get-at data. This course was FULL of practical tips of how to perform initial data reviews, handle incomplete or just plain wrong data and methods for acquiring and parsing text, XML, JSON and SQL data into R for further analysis.

My first professional accomplishments in technology came from MS Access ODBC’ing its way into a mainframe database. And years of attempting to parse & review data gave me an appreciation for how much of Data Science is actually spent on the data prep (some say up to 80%).

I will be reviewing my notes from this class for many years to come, I’m sure.

DataSmart9781118661468 cover.indd

Data Smart: Using Data Science to Transform Information into Insight by John Foreman has to be one of the best books on the subject that anyone could read.

Using only Excel, Foreman covers a wide range of Data Science topics from K-clustering to Machine Learning to Predictive Analytics.

Is Excel the best tool for the job for any of these techniques? Hell no. But if you want to learn about what these techniques are and get hands-on with small datasets to try it out yourself, you can do no better than this book.

His explanations assume you know NOTHING about the topics and he guides you with kid gloves as he takes you through step by step in an Excel workbook, explaining every formula and technique along the way.

I completed my initial read of this book in August and it provided me with a comfort level on a number of topics that I’m starting to see pop up again in some advanced courses. It by no means made me an expert, but it did provide a solid introduction to that world. I’m working my way through the book a second time, hand-typing every formula into Excel to increase my mastery.

Best surprise: Learning how to use Excel’s Linear Solver tool. That’s just plain cool.

Working with Data using SAP HANA Studio

SAP (an some of its partners) set up a channel on YouTube called SAP HANA Academy. In it, you will find hundreds of SAP HANA related videos, from administration to installation to development.

I watched all 24 videos in the Working with Data using SAP HANA Studio playlist. They were a bit outdated (HANA is now on SPS10, most of these videos were from SPS5).  There was a lot of redundancy among the videos as each were created independent of others, so no assumption was made that the viewer was following a designated path.

This YouTube channel is a wealth and a mess of information. Do not mistake this as a real “academy” that replaces a formal course on this subject. The sequence of the videos in the playlist do not progress in a logical fashion (why would you want to learn about stored procedures immediately after you just saw a fundamental video on how to connect to an SAP HANA db from HANA Studio?)

But if your budget is $0 like mine was, there was plenty of useful information contained in these videos and I was able to play along in the AWS HANA One instance I created.

Statistics

In addition to the online Khan Academy learning experience that I had written about before I dabbled in a couple of books about probability and statistics (namely, Statistics for Dummies and Statistics in a Nutshell).

But the best book on the subject came as a $1 purchase at a library used book sale. Comprehending Behavioral Statistics has to be one of the greatest, easiest to follow books on the subject. If you are wanting to begin, this is the best place to go. I’ve read about two-tailed t-tests in other books – but never really understood it until this one.

There are a ton of exercises at the end of each chapter with answers in the back of the book. The pace is nice and slow. The chapters are very narrowly-scoped topics that are easily digestible. What a great find!