May Interlude – Returning from ODSC East

I returned recently from ODSC – East, a 3-day Data Science Conference held in Boston. I am no stranger to tech conferences, having attended numerous SAP, SalesForce.com and other major events across the country. But this was the first ODSC I’ve attended (not surprising since I only found my calling for data science last summer). And it may very well be my last.

Everyone’s experience was certainly unique. No doubt, there were many who felt this was the best 3-day experience of their lives. I also know personally of a few who were so frustrated by the event, they bailed out on Sunday.

So what I write below are my observations and experiences. If you are considering going to one in the future, please do seek out other opinions. But for the $800 I shelled out for this weekend (registration, hotel, travel, food, parking), I could have gotten 10 Coursera courses or 15 books instead, and benefitted far more than I did from this conference.

My goals were simple:

  • Learn data science (technology, methodology, state of the industry, etc)
  • Explore career opportunities
  • Network

For me, this conference fell short on all three accounts. More below:

The Friday training workshops

2016-05-20 11.26.26If you paid a little extra, you could attend two 4-hour workshops on Friday. There were a number you could choose from, but it’s a gamble. If you pick wrong, you just wasted 4 hours.

Unfortunately, the first session I picked, “Business and Science Domain Applications of Big Data Analytics Using R” was a bad pick. It was a very disjointed presentation, bouncing between reading research papers verbatim from a big screen and imagining visualizations on an empty wall. It was a stream-of-consciousness exploration of a number of machine learning techniques (like nearest neighbors and principle component analysis), but none of it clear to someone who came to learn about it. There was no mention of “big data” (the iris dataset does not count as big data) nor was there much R to be seen. Although, there was a surprise Python code review randomly thrown in the middle.

The afternoon session “Hadoop Hands On” was a little better. I only cursorily know what Hadoop (that there’s a yellow puffy elephant was the extent of my knowledge) and hoped this workshop would fill in the void. David Whitehouse from Cloudera did a great job explaining the role Hadoop plays in the overall Big Data stack, and how other platforms like Spark and Kafka play a role. He even got under the hood to explain how data was stored and gave some working demonstrations of scripts to pull various data together.

Disappointingly, there was very little “hands-on” to be seen. He did bring 20 USB drives containing a 5 GB Cloudera virtual instance which were passed around the 200 or so members of the audience to copy to their local machines. But aside from that, it was pretty much a 4 hour lecture, which was certainly engaging for the first 3 hours. I really did wish this workshop could’ve been rethought to allow for more hands-on activity.

There were many other workshops I could have attended, including:

  • Intro to Data Visualization
  • Using Open Data with APIs
  • Intro to Python for Data Science
  • Adaptive Deep Learning for Vision and Language
  • Interactive Data Visualizations in R with Shiny and ggplot2
  • …quite a few others

From what I saw on the Twitter feed, there were mixed reviews on these as well. Some were praised, others were panned.

The Saturday morning keynotes

Saturday morning started with a packed house of maybe 600+ data scientists to here the following four “celebrities” speak

  • Ingo Mierswa, RapidMiner
  • Lukas Biewald, Crowdflower
  • the famous Kirk Borne, Booz Allen Hamilton
  • Stephan Karpinski, Julia Computing

All four were engaging. Of course, Kirk Borne knocked it out of the park. I was most pleasantly surprised by Stephan Karpinski who spoke about the Julia programming language. Unless he was BS’ing up there, it’s a programming language especially suited for speedy data analytics and definitely worth exploring.

My recommendation for the organizers: In future sessions, please do go out of your way to provide a panel of speakers that’s just a bit more diverse than 4 white guys. We don’t have to look far. In my non-scientific scan of the attendees, they came in all shapes, sizes and colors. We could certainly stand to see something a little more representative in the keynote speakers.

The Saturday & Sunday sessions

image

The “meat” of the conference were the nearly 100 sessions to choose from over the weekend. A couple of quick observations and then some highlights.

Organizers were not prepared for the size of the crowd. You would have thought every data scientist in the world was in attendance. The section of the convention center to which this conference was isolated was too small for the volume of people. Starting with the registration process Friday morning (long, slow lines you’d expect from a new Disney World ride) it was clear that there was a logistics problem. There were coffee & bagels provided, but as one person observed, “Nothing sadder than seeing data scientists in front of an empty coffee urn”.

These sessions were first come, first serve. I think pre-registration would have been very welcome here. Quite a few of the sessions were packed with every seat taken and the floor filled with sitting participants trying to make do. In one of the sessions, an organizer had to kick about 20 attendees out because the room was over capacity. Once it became clear that sessions could fill up so quickly, the attendees jostled to get into rooms with the fervor of Black Friday shoppers at the mall.

Scheduling was erratic and inconsistent. There were three sources of data for the schedule: The web site, a printed paper handed out upon registration and a mobile app. It was observed that sometimes the three didn’t agree. I can certainly appreciate that last-minute changes do occur – but when something fundamental like lunch changes, you can see there were opportunities to be had in the scheduling department.

The mobile app was awful for trying to find a session of interest. The conference had several major tracks: Disruptive Data, Big Data, Data Visualization, and Data Science for Good. You were free to pick any sessions from among any of the tracks – but the mobile app would only show you one track at a time and one room at a time. Trying to determine what’s playing at 12:30pm? You have to navigate each track separately and click on each room to see what’s playing. It could take you five minutes to figure out where to go. One of the attendees was generous enough to create and share a GoogleDoc with a simple grid-view of all the sessions. Sometimes, simple is better.

image

Finally, they need to build in buffer time between sessions (5-10 minutes) to allow travel time from room-to-room as well as to allow rooms to clear out before attendees arrive for the next session. It got pretty ugly when 150 people trying to get into a room that was still packed from the previous session that had just let out.

Workshops were often just lectures. Don’t advertise something as a workshop if it’s just a speaker talking for an hour. You’re supposed to work at a workshop. If you don’t intend to make it a hands-on experience, fine. There’s certainly plenty to be learned from lectures as well. Just don’t call it a workshop.

Sessions of note

There were a couple of sessions that I do want to give praise to.

  • Dhiana Deva (Spotify) gave a very energetic, down-to-earth tour of various machine learning techniques. Titled “Machine Learning for Everyone”, she certainly delivered her promise making the most of the 30 minutes she had and provided a great service to brand-new data scientists.
  • imageAllen Downey (Olin College) led a great hands-on workship on “Bayesian Statistics Made Simple”. A full appreciation of Bayes’ statistics has eluded me, but with a combination of GoogleSheets and iPython notebooks, I have a much deeper understanding of the subject. Next month, I will be reading his freely available book, “Think Bayes”.
  • Jared Lander (Lander Analytics) led my favorite workshop: “ggplot2: from scratch to compelling graphs”. My fingers are still sore from keeping up.  The best quote from the conference came from him as he took ggplot2 requests from the audience: “It’s going to be disgusting, but let’s do it. Oh God, it became a big pile of vomit!”
  • Ted Kwartler (Liberty Mutual Insurance) had a very interesting session, “Introduction to Text Mining in R”.  With Ted, you felt like you saddled up to an experienced coworker who told you to toss away the textbook and he’s going to show you how it’s really done. He was generous enough to share his R code, cultivated over years of battle-tested experience.

The Career Fair

I have a future blog coming soon in which I will become more personal and tell my story.  But for the purposes of this post, I have recently left my job of CIO for a large distribution company (my place of employment for 19 years) and am in the process of selling my house in New York and relocating to South Carolina. I’m not actively seeking employment – but I will once my move is permanent (hopefully no later than then end of the summer).

I dropped in at the Career Fair to do a little reconnaissance on what companies are looking for: skills, experience, location of employees, etc. What I discovered there was not all that encouraging.

First, I did observe that the booths were divided evenly between educational programs looking to take on students (like Rutgers, MIT,  General Assembly). I’m not planning to plunk down $20k-$40k on a degree program, so I passed on those. The remaining 10 tables were represented by corporations like Johnson & Johnson, Nielson Ratings, McKinsey & Company and CVS. The stories I saw there were all very similar:

  • Go to our web site. You will see exactly what positions we are looking for
  • A minimum of Masters or PhD in data analysis
  • Need to live in or near a big city (Boston, NY, Washington, Chicago)

imageIt also appeared that Data Architects (those versed in Hadoop, Spark, storage, etc) were more in demand than the data scientists who live in R or Python notebooks.

I didn’t walk away from the Career Fair with any extraordinary level of confidence. But then again, it was a very, very small sample of companies that are looking to hire – and you know what they say about extrapolating from small samples.

Overall

My overall impressions were the following:

  • This is a conference that’s experiencing growing pains. Hopefully, they learn and improve with each iteration
  • Advanced QA needs to be performed on the sessions. Expected standards of delivery should be given (i.e. workshops should be interactive, presentations should be concise with slides presented in advance)
  • I would have benefitted from more organized assistance with networking and socialization. A networking beer party in a crowded dim room that’s too loud to have any conversation doesn’t work for me. It’d work just fine if I were showing up with 5 coworkers. But I knew absolutely no one there – so a better environment for breaking the ice would have been appreciated
  • Guidance on level of expertise needed at various sessions. Maybe tracks labeled: Beginner, Intermediate, Expert
  • Scheduling needs to be fixed and the app overhauled

ODSC has a lot of potential, and I may attend again next year and hopefully walk away feeling like I got my money’s worth.

Advertisements

Accomplishments – Apr 2016

April was a light month for me. It marked the end of my 19 year career with my previous employer as well as the beginning of preparation to sell my house in NY (and an eventual relocation to South Carolina – and hopefully, a new career in data science).

But my personal story is for another blog, another week. Despite the “real world” distractions, I was able to move forward a little bit in my self-driven DS education.

Completed Items

Coursera – Machine Learning: Regression

This was the second course in the 6-part Machine Learning series developed by the University of Washington. I gave the first class (Machine Learning Foundations) a very positive review last month. Where that class was a hands-on, high-level overview of all the topics covered in the specialization, this course gets very detailed on various Regression topics including:

  • Simple and multiple linear regression
  • Gradient descent vs closed form solutions to minimizing residual sum of squares
  • Feature selection using ridge regression and lasso
  • Model selection using validation sets and cross-validation
  • Non-parameterized solutions like Nearest Neighbors and kernel regression

imageThis was far from a walk in the park. The lectures given exclusively by Emily Fox were some of the clearest I’ve seen on the subject. Her explanation of lasso and how it causes coefficients to drop to zero was just about the best I’ve seen. She provided video simulations, optional math derivations (hint: don’t let them be optional for you), and annotated lecture notes for each week.

I thought the quiz questions were non-trivial but very relevant to the material covered. The self-guided programming assignments were delivered as iPython Notebooks and required you to develop your own functions that implemented the algorithms discussed in the lectures and while challenging, they were not impossible to create.

imageThe only drawback to this otherwise perfect course was the default utilization of GraphLab Create, a retail ML library that’s an alternative to scikit-learn and pandas (you get to use it for free for up to 1 year). While you are free to use standard, open libraries to work on the programming assignments, if you’re not familiar with them (like myself), you’re are somewhat forced to use GraphLab Create. It didn’t detract from the knowledge I gained about the ML techniques but in hindsight I wish I could’ve gained knowledge and experience on the standard Python libraries in the process.

I’ve moved onto the third course in this series, Machine Learning: Classification, and now with a little more Python confidence under my belt, I’m going to attempt to do the assignments using standard Python libraries.

Natural Language Processing YouTube videos

imageIf you’ve got a little free time on your hands, you can digest a series of 102 videos on the topic of Natural Language Processing given by Stanford professors Dan Jurafsky & Chris Manning. This very specific offshoot of Machine Learning and statistical techniques was the subject of the JHU Data Science Specialization Capstone Project that I still need to complete. So while I’m waiting for the end of May for that session to open, I spent a little time here to familiarize myself on the topic.

Wow – what a deep and rich subject area this is. The things we take for granted: Google searches, spellcheck, next-word anticipation, Siri question translation… All of these have roots here in NLP and it was a fascinated trip through basic techniques like word-frequency measurements to n-grams and smoothing techniques. I only made it through to video #32, but I do feel a lot more well-versed on this subject and I’ve bookmarked this YouTube playlist as a go-to source of information on NLP.

The Theory That Would Not Die

imageHow Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy  by Sharon Bertsch McGrayne

This was an interesting read I picked up with an Amazon Kindle gift card. I’m was in the hunt for a good description of how Bayesian inference is performed – and I’m still in that hunt after completing this book.

This is not a mathematics book; it’s a history book – and a fairly interesting one at that. It was a welcome break from dealing with program libraries and mathematical derivations. Instead, you saw the evolution of a mathematical ideas over the course of centuries and how that idea had some very real-world implications.

You will need to get past the first two chapters before it becomes a page-turner, but the wait is well worth it. You see where Bayesian concepts competed with other ideas (and lost out just as often as it won). You also read how progress is made as well as impeded by the personalities of the talents involved. Based on this book, you’d think the great mathematicians were brilliant jerks.

In all, you won’t find any but the lightest of mathematical equations, but you will get an appreciation of why we need data science in the world.

Abandoned Items

Data Science from Scratch: First Principles with Python

DataScienceFromScratch

Data Science from Scratch was one of the first books I picked up and reviewed back in August, 2015. However, being so new to the subject and having had zero exposure to Python, I had to shelve this book for a later date.

Well, April 2016 was that later date and while I no longer felt the subject matter or choice of programming language to be an impediment, I still abandoned this book halfway through it.

For me, it failed on two levels:

  • Expanding my knowledge of Python
  • Expanding my knowledge of Data Science

Being that those two items are contained in the title of the book, that’s a serious issue.

From a Python perspective, I understand that this wasn’t a Python book. I’ve got other books for that. But when providing code examples for data analysis concepts, the author chose to implement methods from scratch and ignore any well-established libraries that are used in the real world (like scikit-learn, numpy or pandas). Yes, “from scratch” is also in the title, on on that point the book held true. But take for example the chapter on Statistics: instead of using the well-established functions for the common statistical calculations, you create your own methods to calculate mean, variance, correlation, etc. You build new methods on top of old methods and before you know it, you’re creating your own library of functions.  And only in the “further references” section at the end of the chapter do you see a suggestion to go look up numpy documentation.

The chapters were very short, and as such it could not convey the data science concepts very well. And with half the chapter spent on developing your own code for algorithms already established elsewhere, there is not a lot of room to discuss topics properly like Gradient Descent, Bayes’ Theorem or Hypothesis Testing. You are better served on each of these topics elsewhere.

Overall, I did not see a lot of value in what this book offers.

Looking ahead to May

My primary goal for May is to put my house up for sale. There’s a lot of work to be done in order to make it sell-worthy, but somewhere I hope to be working on the following tracks:

  • An Introduction to Statistical Learning with Applications in R – hit Chapters 6 – 7
  • Udemy – Taming Big Data with Apache Spark (started in April, will finish up in May)
  • Coursera – Machine Learning – 03 – Classification – The third course in the 6 course series (discussed above)
  • Getting Started with Python Data Analysis – book that covers the numpy, matplotlib, pandas and sci-kit libraries.
  • Khan Academy – Linear Algebra track – I feel like I could use some expertise in this area