Accomplishments – May 2016

As a reminder, a one-page summary of all the courses, books & videos I’ve reviewed in the past year can be found on my Journey Roadmap page.

The highlight of May was attending ODSC East (Open Data Science Conference) in Boston. I wrote extensively about it in my previous blog post, so I won’t repeat anything about it here.

May provided a few more steps forward in my Data Science knowledge. I also have spent a good portion of June emptying my house in preparation for a sale, for which I now wait. And wait. And wait…

I’ll be increasing this blog to a weekly frequency, with some topics of interest like:

  • Note-taking methods for MOOCs
  • Using Mnemosyne to stay sharp on techniques
  • The power of multi-core processing in R
  • Blogs I follow
  • … and whatever else strikes my fancy

But for now, I’m well overdue for my May recap, so without further ado:

Completed Items

Coursera – Machine Learning: Classification

This was the third course in the 6-part Machine Learning series developed by the University of Washington. Once again, the team of Emily Fox (professor at UoW) and Carlos Guestrin (founder & CEO of Dato) hit a home run with this Python-based course.

imageThe topics covered were standard in the realm of Classification:

  • Linear & Logistic Classifiers
  • Overfitting
  • Decision Trees
  • Handling missing data
  • Boosting & AdaBoost
  • Precision & Recall
  • Huge datasets and Stochastic Gradient Descent

As with the previous courses, there were very relevant online quizzes to test your comprehension of the material. Also, there were engaging iPython notebooks that reviewed the lecture material and walked you through Python code to implement the algorithms discussed as well as explore how those algorithms handle various extreme situations using real data.

image

I’m not an experienced Python developer, so while I did find these assignments challenging, they provided a good amount of skeletal code as well as checkpoints to test the code and I was able to successfully complete all assignments. More importantly, getting my hands dirty with the algorithms truly solidified the theory covered in the lectures.

The fourth course in the series, Clustering & Retrieval, begins next week and I cannot wait!

Getting Started with Python Data Analysis

image

by Phuong Vo.T.H, Martin Czygan

Back in January, I purchased nearly a dozen books from Packt Publishing as they held a post-holiday sale where you could get unlimited electronic books (various formats, including non-DRM’d pdf) for $5 each. Let me tell you, I went to town.

Well, I can see why these books were so inexpensive.  They’re unreviewed, non-proofread whitepapers written by authors looking to pad their resumes as being “published authors”.

Getting Started with Python Data Analysis” had a promising synopsis:

With this book, we will get you started with Python data analysis and show you what its advantages are.

Topics covered included:

  • pandas
  • numpy
  • matplotlib
  • various data retrieval libraries
  • scikitlearn

Unfortunately, the book was riddled with code samples that would not run. For example:

from matplotlib import pyplot as plt
import numpy as np
x = np.linspace(0, 3, 6)
y = np.power(x,2)
line = plt.plot(y, color='red', linewidth=2.0)
line.set_linestyle('--')
plt.setp(line, marker='o')
plt.show()

generated an error with the line.set_linestyle function. Instead, after reviewing a different Python book, I was able to get the snippet to run with the following:

from matplotlib import pyplot as plt
import numpy as np
x = np.linspace(0, 3, 6)
y = np.power(x,2)
line = plt.plot(y, color='red', linewidth=2.0, linestyle='--')
plt.setp(line, marker='o')
plt.show()

There’s little awareness of what topics have been covered and you will see topics dropped in out of the blue without reference. For example the following contour plot code introduces np.newaxis, something not referenced prior to or subsequent to this snippet.

image

Finally, the spelling mistakes were really distracting. Below, the spelling of “unequal” in the graph is bad enough – but the fact that it’s also misspelled differently in the code caused me a bit of head-scratching.

image

image

Even despite the typos and code errors. content-wise the material is light on each subject. Give this book a pass. I’ll let you know if I uncover something better.

Taming Big Data with Apache Spark and Python – Hands On!

At ODSC, I heard quite a bit about Spark. I’m only passingly familiar with Big Data topics like Hadoop and MapReduce, but I got the understanding that Apache Spark is an exciting new platform that bridges the world of big data storage with data science analysis, bringing to bear much of the techniques I’ve been learning without necessarily having to master the world of clustered data storage and retrieval.

imageFor $25, I tried out the Udemy Course “Taming Big Data with Apache Spark and Python” and I truly believe I got more than my money’s worth out of this course.

For a hands-on introduction to Spark, I don’t see how this can be beat. You are walked through how to install Spark locally, you’re given some sample data to work with and you have your first Python-based Spark script up and running in no time. You will need to have had some experience with Python (certainly not expert-level) to get the most out of the course.

Throughout the 46 videos, the instructor (Frank Kane) does an excellent job of introducing the topic, providing theory (“here’s what we’re gonna do”), stepping through the code (“here’s how we’re going to do it”) and recapping (“here’s what we just did”).

There weren’t any quizzes or formal graded assignments like you would find in your typical Coursera or edX course, but that was just fine here. You were certainly given “try it yourself” exercises whose solutions were covered in a subsequent video.

I found the material easily accessible for the most part although I did struggle on the advanced topic of network graphs (finding friend connections iteratively). At the end, you attempt to set up an Amazon Web Service instance for trying out Spark on a much larger data set (as opposed to the small samples you practice with on your laptop). However, despite setting everything up successfully, I kept running into memory errors with the AWS instance and couldn’t get the script to run successfully. Debugging would have required more advanced understanding of Spark than this course provided.

All-in-all, I recommend this course as a great starting point.

Looking ahead to June

Well, technically, there’s only a week left in June as I write this. And I did take a 2-week hiatus from data science learning to deal with prepping my house for sale. However, I am in the midst of the following:

  • Continuing to pore through An Introduction to Statistical Learning with Applications in R, which has become my “bible” for machine learning techniques.
  • I’ve been interested enough in Spark to pursue a second course in the subject. On the edX platform, “BerkeleyX: CS105x Introduction to Apache Spark” kicked off last week and it looks very promising. Through the generosity of Databricks, all students are given a free account to set up a 6-GB Spark instance to be used for the course.
  • I’m taking a step back to review some old material. Next week, the fourth course in UoW’s 6-part Machine Learning series starts up, and I’m looking back at the notes & code for the 2nd & 3rd courses.
  • I’m continuing to putter around with the JHU Data Science Capstone on the subject of Natural Language Processing. I keep stalling out in Week #2 as other items constantly take higher priority. I’m most likely going to have to re-register in July and finish this off for good.
  • I’m also spending more time honing my skills in R and Python through the use of Mnemosyne. I’ll have a full blog on that subject in a couple of week.

As always, I look forward to your comments!

May Interlude – Returning from ODSC East

I returned recently from ODSC – East, a 3-day Data Science Conference held in Boston. I am no stranger to tech conferences, having attended numerous SAP, SalesForce.com and other major events across the country. But this was the first ODSC I’ve attended (not surprising since I only found my calling for data science last summer). And it may very well be my last.

Everyone’s experience was certainly unique. No doubt, there were many who felt this was the best 3-day experience of their lives. I also know personally of a few who were so frustrated by the event, they bailed out on Sunday.

So what I write below are my observations and experiences. If you are considering going to one in the future, please do seek out other opinions. But for the $800 I shelled out for this weekend (registration, hotel, travel, food, parking), I could have gotten 10 Coursera courses or 15 books instead, and benefitted far more than I did from this conference.

My goals were simple:

  • Learn data science (technology, methodology, state of the industry, etc)
  • Explore career opportunities
  • Network

For me, this conference fell short on all three accounts. More below:

The Friday training workshops

2016-05-20 11.26.26If you paid a little extra, you could attend two 4-hour workshops on Friday. There were a number you could choose from, but it’s a gamble. If you pick wrong, you just wasted 4 hours.

Unfortunately, the first session I picked, “Business and Science Domain Applications of Big Data Analytics Using R” was a bad pick. It was a very disjointed presentation, bouncing between reading research papers verbatim from a big screen and imagining visualizations on an empty wall. It was a stream-of-consciousness exploration of a number of machine learning techniques (like nearest neighbors and principle component analysis), but none of it clear to someone who came to learn about it. There was no mention of “big data” (the iris dataset does not count as big data) nor was there much R to be seen. Although, there was a surprise Python code review randomly thrown in the middle.

The afternoon session “Hadoop Hands On” was a little better. I only cursorily know what Hadoop (that there’s a yellow puffy elephant was the extent of my knowledge) and hoped this workshop would fill in the void. David Whitehouse from Cloudera did a great job explaining the role Hadoop plays in the overall Big Data stack, and how other platforms like Spark and Kafka play a role. He even got under the hood to explain how data was stored and gave some working demonstrations of scripts to pull various data together.

Disappointingly, there was very little “hands-on” to be seen. He did bring 20 USB drives containing a 5 GB Cloudera virtual instance which were passed around the 200 or so members of the audience to copy to their local machines. But aside from that, it was pretty much a 4 hour lecture, which was certainly engaging for the first 3 hours. I really did wish this workshop could’ve been rethought to allow for more hands-on activity.

There were many other workshops I could have attended, including:

  • Intro to Data Visualization
  • Using Open Data with APIs
  • Intro to Python for Data Science
  • Adaptive Deep Learning for Vision and Language
  • Interactive Data Visualizations in R with Shiny and ggplot2
  • …quite a few others

From what I saw on the Twitter feed, there were mixed reviews on these as well. Some were praised, others were panned.

The Saturday morning keynotes

Saturday morning started with a packed house of maybe 600+ data scientists to here the following four “celebrities” speak

  • Ingo Mierswa, RapidMiner
  • Lukas Biewald, Crowdflower
  • the famous Kirk Borne, Booz Allen Hamilton
  • Stephan Karpinski, Julia Computing

All four were engaging. Of course, Kirk Borne knocked it out of the park. I was most pleasantly surprised by Stephan Karpinski who spoke about the Julia programming language. Unless he was BS’ing up there, it’s a programming language especially suited for speedy data analytics and definitely worth exploring.

My recommendation for the organizers: In future sessions, please do go out of your way to provide a panel of speakers that’s just a bit more diverse than 4 white guys. We don’t have to look far. In my non-scientific scan of the attendees, they came in all shapes, sizes and colors. We could certainly stand to see something a little more representative in the keynote speakers.

The Saturday & Sunday sessions

image

The “meat” of the conference were the nearly 100 sessions to choose from over the weekend. A couple of quick observations and then some highlights.

Organizers were not prepared for the size of the crowd. You would have thought every data scientist in the world was in attendance. The section of the convention center to which this conference was isolated was too small for the volume of people. Starting with the registration process Friday morning (long, slow lines you’d expect from a new Disney World ride) it was clear that there was a logistics problem. There were coffee & bagels provided, but as one person observed, “Nothing sadder than seeing data scientists in front of an empty coffee urn”.

These sessions were first come, first serve. I think pre-registration would have been very welcome here. Quite a few of the sessions were packed with every seat taken and the floor filled with sitting participants trying to make do. In one of the sessions, an organizer had to kick about 20 attendees out because the room was over capacity. Once it became clear that sessions could fill up so quickly, the attendees jostled to get into rooms with the fervor of Black Friday shoppers at the mall.

Scheduling was erratic and inconsistent. There were three sources of data for the schedule: The web site, a printed paper handed out upon registration and a mobile app. It was observed that sometimes the three didn’t agree. I can certainly appreciate that last-minute changes do occur – but when something fundamental like lunch changes, you can see there were opportunities to be had in the scheduling department.

The mobile app was awful for trying to find a session of interest. The conference had several major tracks: Disruptive Data, Big Data, Data Visualization, and Data Science for Good. You were free to pick any sessions from among any of the tracks – but the mobile app would only show you one track at a time and one room at a time. Trying to determine what’s playing at 12:30pm? You have to navigate each track separately and click on each room to see what’s playing. It could take you five minutes to figure out where to go. One of the attendees was generous enough to create and share a GoogleDoc with a simple grid-view of all the sessions. Sometimes, simple is better.

image

Finally, they need to build in buffer time between sessions (5-10 minutes) to allow travel time from room-to-room as well as to allow rooms to clear out before attendees arrive for the next session. It got pretty ugly when 150 people trying to get into a room that was still packed from the previous session that had just let out.

Workshops were often just lectures. Don’t advertise something as a workshop if it’s just a speaker talking for an hour. You’re supposed to work at a workshop. If you don’t intend to make it a hands-on experience, fine. There’s certainly plenty to be learned from lectures as well. Just don’t call it a workshop.

Sessions of note

There were a couple of sessions that I do want to give praise to.

  • Dhiana Deva (Spotify) gave a very energetic, down-to-earth tour of various machine learning techniques. Titled “Machine Learning for Everyone”, she certainly delivered her promise making the most of the 30 minutes she had and provided a great service to brand-new data scientists.
  • imageAllen Downey (Olin College) led a great hands-on workship on “Bayesian Statistics Made Simple”. A full appreciation of Bayes’ statistics has eluded me, but with a combination of GoogleSheets and iPython notebooks, I have a much deeper understanding of the subject. Next month, I will be reading his freely available book, “Think Bayes”.
  • Jared Lander (Lander Analytics) led my favorite workshop: “ggplot2: from scratch to compelling graphs”. My fingers are still sore from keeping up.  The best quote from the conference came from him as he took ggplot2 requests from the audience: “It’s going to be disgusting, but let’s do it. Oh God, it became a big pile of vomit!”
  • Ted Kwartler (Liberty Mutual Insurance) had a very interesting session, “Introduction to Text Mining in R”.  With Ted, you felt like you saddled up to an experienced coworker who told you to toss away the textbook and he’s going to show you how it’s really done. He was generous enough to share his R code, cultivated over years of battle-tested experience.

The Career Fair

I have a future blog coming soon in which I will become more personal and tell my story.  But for the purposes of this post, I have recently left my job of CIO for a large distribution company (my place of employment for 19 years) and am in the process of selling my house in New York and relocating to South Carolina. I’m not actively seeking employment – but I will once my move is permanent (hopefully no later than then end of the summer).

I dropped in at the Career Fair to do a little reconnaissance on what companies are looking for: skills, experience, location of employees, etc. What I discovered there was not all that encouraging.

First, I did observe that the booths were divided evenly between educational programs looking to take on students (like Rutgers, MIT,  General Assembly). I’m not planning to plunk down $20k-$40k on a degree program, so I passed on those. The remaining 10 tables were represented by corporations like Johnson & Johnson, Nielson Ratings, McKinsey & Company and CVS. The stories I saw there were all very similar:

  • Go to our web site. You will see exactly what positions we are looking for
  • A minimum of Masters or PhD in data analysis
  • Need to live in or near a big city (Boston, NY, Washington, Chicago)

imageIt also appeared that Data Architects (those versed in Hadoop, Spark, storage, etc) were more in demand than the data scientists who live in R or Python notebooks.

I didn’t walk away from the Career Fair with any extraordinary level of confidence. But then again, it was a very, very small sample of companies that are looking to hire – and you know what they say about extrapolating from small samples.

Overall

My overall impressions were the following:

  • This is a conference that’s experiencing growing pains. Hopefully, they learn and improve with each iteration
  • Advanced QA needs to be performed on the sessions. Expected standards of delivery should be given (i.e. workshops should be interactive, presentations should be concise with slides presented in advance)
  • I would have benefitted from more organized assistance with networking and socialization. A networking beer party in a crowded dim room that’s too loud to have any conversation doesn’t work for me. It’d work just fine if I were showing up with 5 coworkers. But I knew absolutely no one there – so a better environment for breaking the ice would have been appreciated
  • Guidance on level of expertise needed at various sessions. Maybe tracks labeled: Beginner, Intermediate, Expert
  • Scheduling needs to be fixed and the app overhauled

ODSC has a lot of potential, and I may attend again next year and hopefully walk away feeling like I got my money’s worth.

Accomplishments – Apr 2016

April was a light month for me. It marked the end of my 19 year career with my previous employer as well as the beginning of preparation to sell my house in NY (and an eventual relocation to South Carolina – and hopefully, a new career in data science).

But my personal story is for another blog, another week. Despite the “real world” distractions, I was able to move forward a little bit in my self-driven DS education.

Completed Items

Coursera – Machine Learning: Regression

This was the second course in the 6-part Machine Learning series developed by the University of Washington. I gave the first class (Machine Learning Foundations) a very positive review last month. Where that class was a hands-on, high-level overview of all the topics covered in the specialization, this course gets very detailed on various Regression topics including:

  • Simple and multiple linear regression
  • Gradient descent vs closed form solutions to minimizing residual sum of squares
  • Feature selection using ridge regression and lasso
  • Model selection using validation sets and cross-validation
  • Non-parameterized solutions like Nearest Neighbors and kernel regression

imageThis was far from a walk in the park. The lectures given exclusively by Emily Fox were some of the clearest I’ve seen on the subject. Her explanation of lasso and how it causes coefficients to drop to zero was just about the best I’ve seen. She provided video simulations, optional math derivations (hint: don’t let them be optional for you), and annotated lecture notes for each week.

I thought the quiz questions were non-trivial but very relevant to the material covered. The self-guided programming assignments were delivered as iPython Notebooks and required you to develop your own functions that implemented the algorithms discussed in the lectures and while challenging, they were not impossible to create.

imageThe only drawback to this otherwise perfect course was the default utilization of GraphLab Create, a retail ML library that’s an alternative to scikit-learn and pandas (you get to use it for free for up to 1 year). While you are free to use standard, open libraries to work on the programming assignments, if you’re not familiar with them (like myself), you’re are somewhat forced to use GraphLab Create. It didn’t detract from the knowledge I gained about the ML techniques but in hindsight I wish I could’ve gained knowledge and experience on the standard Python libraries in the process.

I’ve moved onto the third course in this series, Machine Learning: Classification, and now with a little more Python confidence under my belt, I’m going to attempt to do the assignments using standard Python libraries.

Natural Language Processing YouTube videos

imageIf you’ve got a little free time on your hands, you can digest a series of 102 videos on the topic of Natural Language Processing given by Stanford professors Dan Jurafsky & Chris Manning. This very specific offshoot of Machine Learning and statistical techniques was the subject of the JHU Data Science Specialization Capstone Project that I still need to complete. So while I’m waiting for the end of May for that session to open, I spent a little time here to familiarize myself on the topic.

Wow – what a deep and rich subject area this is. The things we take for granted: Google searches, spellcheck, next-word anticipation, Siri question translation… All of these have roots here in NLP and it was a fascinated trip through basic techniques like word-frequency measurements to n-grams and smoothing techniques. I only made it through to video #32, but I do feel a lot more well-versed on this subject and I’ve bookmarked this YouTube playlist as a go-to source of information on NLP.

The Theory That Would Not Die

imageHow Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy  by Sharon Bertsch McGrayne

This was an interesting read I picked up with an Amazon Kindle gift card. I’m was in the hunt for a good description of how Bayesian inference is performed – and I’m still in that hunt after completing this book.

This is not a mathematics book; it’s a history book – and a fairly interesting one at that. It was a welcome break from dealing with program libraries and mathematical derivations. Instead, you saw the evolution of a mathematical ideas over the course of centuries and how that idea had some very real-world implications.

You will need to get past the first two chapters before it becomes a page-turner, but the wait is well worth it. You see where Bayesian concepts competed with other ideas (and lost out just as often as it won). You also read how progress is made as well as impeded by the personalities of the talents involved. Based on this book, you’d think the great mathematicians were brilliant jerks.

In all, you won’t find any but the lightest of mathematical equations, but you will get an appreciation of why we need data science in the world.

Abandoned Items

Data Science from Scratch: First Principles with Python

DataScienceFromScratch

Data Science from Scratch was one of the first books I picked up and reviewed back in August, 2015. However, being so new to the subject and having had zero exposure to Python, I had to shelve this book for a later date.

Well, April 2016 was that later date and while I no longer felt the subject matter or choice of programming language to be an impediment, I still abandoned this book halfway through it.

For me, it failed on two levels:

  • Expanding my knowledge of Python
  • Expanding my knowledge of Data Science

Being that those two items are contained in the title of the book, that’s a serious issue.

From a Python perspective, I understand that this wasn’t a Python book. I’ve got other books for that. But when providing code examples for data analysis concepts, the author chose to implement methods from scratch and ignore any well-established libraries that are used in the real world (like scikit-learn, numpy or pandas). Yes, “from scratch” is also in the title, on on that point the book held true. But take for example the chapter on Statistics: instead of using the well-established functions for the common statistical calculations, you create your own methods to calculate mean, variance, correlation, etc. You build new methods on top of old methods and before you know it, you’re creating your own library of functions.  And only in the “further references” section at the end of the chapter do you see a suggestion to go look up numpy documentation.

The chapters were very short, and as such it could not convey the data science concepts very well. And with half the chapter spent on developing your own code for algorithms already established elsewhere, there is not a lot of room to discuss topics properly like Gradient Descent, Bayes’ Theorem or Hypothesis Testing. You are better served on each of these topics elsewhere.

Overall, I did not see a lot of value in what this book offers.

Looking ahead to May

My primary goal for May is to put my house up for sale. There’s a lot of work to be done in order to make it sell-worthy, but somewhere I hope to be working on the following tracks:

  • An Introduction to Statistical Learning with Applications in R – hit Chapters 6 – 7
  • Udemy – Taming Big Data with Apache Spark (started in April, will finish up in May)
  • Coursera – Machine Learning – 03 – Classification – The third course in the 6 course series (discussed above)
  • Getting Started with Python Data Analysis – book that covers the numpy, matplotlib, pandas and sci-kit libraries.
  • Khan Academy – Linear Algebra track – I feel like I could use some expertise in this area