March Interlude – Data Science Podcasts I Follow

I commute (via car) from NY to South Carolina on a regular basis. That 13 hour drive lends itself to great opportunity to listen to new music, get lost in an audiobook or be entertained & informed by podcasts.

imageThe only app I’ve ever used for podcast management is PodCast Addict. I enter in the URLs of the podcasts I want to follow and it notifies me when new episodes are available. Best of all, I can download those podcasts to my phone over a wifi connection and listen to them offline while I travel.

Below are the podcasts I follow religiously, having not missed an episode for months. I list these in no particular order. Click on the podcast logo to be taken to their site.


imageBack in July I wrote about an incredible book called The Data Science Handbook, a captivating collection of interviews with 25 data scientists from a variety of backgrounds who are out there, every day, doing it!

Renee Teate has implemented a podcast version of this book (although she has no association with it to my knowledge). To date, she’s conducted 7 interviews with a diverse group of data scientists and she’s just recently “complained” that’s she’s nowhere near done with the original list of individuals she brainstormed when she had first developed this idea. One of the most captivating interviews was with Safia Abdalla (@captainsafia), a student at Northwestern University who’s already leading Python Data Science groups and speaking (inter)nationally at conferences!

Podcast frequency: Every other week

Follow Rene on Twitter (@BecomingDataSci).


image

With Data Skeptic, you really get two podcasts in one. On one hand, you get mini-episodes – brief, non-technical discussions between Kyle (the “expert”) and Linh Da (the “non-expert”) on pure Data Science topics like R-squared, multiple regression, the Bonferroni Correction, etc. And I truly mean they are non-technical. Kyle explains these concepts with metaphors, analogies, similes and perhaps one day, in song. Sometimes it’s refreshing to pull your head out of a book of math equations and hear something explained like you’re a 5-year-old (or at least like you’re an undergraduate student).

The other episodes are longer in nature and feature Kyle interviewing various experts on Data-Science-in-the-world topics. Recent topics include: “Models of Mental Simulation”, “Auditing Algorithms” and one of my favorite episodes: “Detecting Pseudo-profound BS”.

Podcast frequency:  Weekly, alternating between mini-episodes and longer length episodes.


image

Partially Derivative is the frat party of Data Scientist podcasts. With 25% of each episode devoted to discussing specialty beers and another 25% devoted to begging listeners for beer donations, the remaining 50% involved high-spirited discussions about Data Science in the news ranging from controversial journal publications to politics.

The three hosts (Jonathan, Vidya and Chris) are founders of the online analytics engine, Popily, so they’ve been out there risking their necks to try to make a living off of Data Science. The least you can do is drop in and listen to their data science bar conversations.

Podcast frequency: Whenever they feel like it: sometime weekly, sometimes biweekly, sometimes they miss February.


image

On the more serious side, Katie and Ben cover data science topics (usually requested by listeners) in more depth.  Each episode focuses on a single topic such as Neural Nets, p-Values and natural language translation (Yiddish, anyone?).

Katie is the anchor to this podcast and with a background in experimental physics (having worked at the Large Hadron Collider at CERN) I find her expansion into data science to be especially interesting. Recently, Katie announced that she will be spending a significant time volunteering work on the White House’s recent National Cancer Initiative. I applaud her efforts and accomplishments.

Ben seems like a nice guy too.

Podcast frequency: Weekly (sometimes more than 1 episode in a given week)


image

Not So Standard Deviations is a podcast by Roger Peng (of Coursera’s JHU Data Science Specialization fame) and Hilary Parker (Senior Data Analyst at Etsy) where they discuss a wide variety of miscellaneous, unrelated topics.

It’s a light-hearted conversation between the two, where they dive into “controversial” topics like the deficiencies of MS Excel as a data science tool, why Roger doesn’t use Netflix and the R package cat, a God-as-my-witness, working R package  devoted to furry & purry cat functions.

Podcast Frequency: Every other week


image

The O’Reilly Data Show Podcast dives more into the Big Data side of things, covering the various architectures and movers & players in the terabyte world.

Each episode features a lengthy interview with a key player in the big data world. Recent episodes spotlighted M.C. Srivas, co-founder of MapR (which holds various speed records for data querying), Fang Yu, co-founder and CTO of DataVisor about using Apache Spark to predict security attack vectors in real-time and Eric Colson, former VP of data science and engineering at Netflix, about the need for human input in the data-science world.

Podcast Frequency: Every other week


image

Signal is NOT a Data Science podcast and I do not recall how I ended up subscribing to it. But it is among my favorites.

It focus exclusively on the drug industry: Drug trial, testing procedures, pricing, availability, competition, business acquisitions, investments, politics, etc.

This is the most professionally produced podcast on this list and is an absolutely riveting look at what goes on in the world of prescription drugs. The most recent episode, “How much are we willing to pay for cures” takes you through the history of treatments for hepatitis C, which is now curable with a high success rate, but at a price.

Frequency: Every other week.


Do you have any podcasts among your favorites that you’d recommend? I’m eager to hear your suggestions below in the comments.

Accomplishments – Feb 2016

It was a good month. A very good month. A month where I’m actually sitting down on the last day of the month to summarize my activities.

9 months. Bite-sized chunks. And I’m amazed at how far I’ve come from just knowing about Data Science and Big Data as buzzwords that a technology executive should know something about.

But what a deep, rich topic this is! And you couldn’t ask for a more open community, seemingly bending backwards to bring more in the fold. I can name a large number of influences in my journey – but I’ll save that for a later post in March. 

Now, onto the Good, the OK and the Ugly for this month:

The Good

Coursera – Developing Data Products

After a couple of weak courses, this 9th and final formal course in the the Johns Hopkins University Data Science Specialization was one of the most fun as well. It was a welcome, creative break to the heavily mathematical and theoretical Regression and Machine Learning courses.

NewZealandIn this course, you learn how to take all of the analytics you developed in R and share it with the universe in an interactive, browser-based format using the Shiny development platform. Go ahead and take a look at my project. It’s a simple, interactive graph, constructed in a few hours (and as a novice, at that).  Here’s an absolutely amazing Shiny application developed by some guy in New Zealand, if you want to see an example of the full potential of this platform.

You also learn how to create presentations using Slidify and RStudio Presenter, where you transform your R code into PowerPoint-like slides which can be easily regenerated if you alter your R analysis. My very brief “pitch” presentation for the small app I created can be found on Rpubs.com.

Content-wise, this was a much lighter course. I learned more from the tutorials on the various application and package websites than from the course lectures. But this course did expose me to a large number of tools that can be used to share whatever data insights I might be lucky to find in the future. And the final course project to get my feet wet in those technologies was worth the price of admission!

Now… onto the final step of this JHU Specialization – the Capstone Project!

Statistics Done Wrong

imageWhat a fun, little book this was. It’s like having a big brother who, after you’ve taken many semesters of hard-core statistics, pulls you aside, gives you a cold beer and tells it to you like it is. Alex Reinhart should be given a lot of credit for surveying the field of science and analysis, pointing out the traps that the professionals fall into and what you should do to avoid them.

Remember that p-value thing you spent a lot of time worrying about? This book gives you the straight scoop on this and other subjects that all scientists and statisticians should know before publicizing results. It’s light on the math (the text assumes you’ve already done the time with the equations elsewhere) but it is heavy on the real-world pitfalls of interpreting the math incorrectly and shares stories from the field where it’s bitten some research groups in a tender spot.

The chapters are short and bite-sized, but the value-per-page ratio is among the best of any book I’ve read so far. You should grab and digest this work – but only after you’ve put your time in with other statistics books and courses elsewhere. This is something that needs to be read after you’ve “learned” statistics, to help iron out the wrinkles in your knowledge.

Stanford’s Statistical Learning

The Stanford Coursera course on Machine Learning, by Andrew Ng, is famous for launching the Coursera platform and was one of the first courses I took. But there’s these other guys at Stanford who may have one-upped Mr. Ng with their own edX

imageStarring the “Click and Clack” of Machine Learning (Trevor Hastie and Rob Tibshirani), these loveable guys take you on a guided tour of topics like regression, classification methods, cross-validation, trees and random forests, support vector machines and the like.

This 9 week course comes with an accompanying textbook (also free), An Introduction to Statistical Learning with Applications in R (6th Ed 2015) which is rich in examples, lab exercises in R and exercises with each chapter.

You don’t need to register for the course to see the videos. A full list of lectures are available online already. However, the course offers challenging quizzes at the end of each lecture (1 or 2 questions each).  The goal is to pass with 50% (which is more challenging than you think), but the questions require thought to extend what was learned in that lesson and you will learn much from getting a question wrong. The course also has a very active discussion forum that far more enriching than any forum in courses I’ve taken to date.

I’ve made it through Week 7 (with two more weeks to go in March). While I’ve attempted to read the textbook along with the lecture videos, I will be planning to go back through the text a second time once this course is over and work through all of the labs and exercises again. This is one course I’ll be referring to time and time again.

Learning Python

imageAfter 6 months of exclusively working in R, I decided I wanted to be bilingual and gain knowledge about Python. I’ll refer you to the Programming section of my Journey’s Roadmap to refresh you on the little I’ve done in Python so far. But with an Amazon gift card I got for Christmas, I took the plunge on the 1500 page tome, Learning Python, by Mark Lutz. I’m proud to say that I’ve made it through 500 pages in February and am on track to finish it sometime in April (assuming I skip over the index pages at the end).

This definitely shouldn’t be the first thing you dive into to learn Python. Take a gentle tutorial or two first.  Read Python for Informatics to get a handle on the basics. Maybe write a program or two on your own to get a feel for the language. Then, definitely acquire this book.

It does not take a “learn by doing projects approach” that Hello! Python failed miserably at. It really does read like a text book, covering very narrow topics in each of its many, many chapters, punctuated with a large number of small, demonstrative examples. But you will need a book like this to fill in the gaps that the other books leave out. See my complaint about how **kwargs was introduced in my Hello! Python diatribe. Well, after Chapter 18 of “Learning Python” I think I understand what it’s all about. I certainly won’t remember **kwargs  six months from now, but when I encounter it again in code I may be reviewing, I’ll know exactly where I’ll be going for a refresher.

The book may seem slow-paced and repetitive at first, but that was fine with me. Despite claims I read about Python’s simplicity being its strength, I’ve come to learn that it is indeed a very rich, complex, powerful language. This text, with on-topic Q&A’s at the end of each chapter, appears to guide you through this language better than most books out there.

I’ll follow up next month, as I plan to be up through Chapter 30 (out of 41) of this 10 lb weapon of knowledge.

The OK

Nothing to report in this category this month. The activities I pursued were either excellent – or clunkers. See below.

The Ugly

edX & Columbia University – Machine Learning for Data Science and Analytics

Columbia University announced a three-course “X-Series” in Data Science several months back. I took the first course, “Statistical Thinking for Data Science and Analytics” and reviewed it negatively last month

imageI took the second course in the series “Machine Learning for Data Science and Analytics” hoping for a better outcome. Instead, it slid even further backwards in terms of meeting expectations.

Here’s what I wrote on a message board in a thread where students were expressing how baffled they were by Quiz questions that were completely irrelevant to the covered material (which was itself was irrelevant to Machine Learning):

I appreciate all of the comments in this thread. At the end of the day, you have to ask yourself “Are you learning anything?”. And after having completed “Statistical Thinking for Data Science and Analytics” and made it through the end of week 3 of “Machine Learning for Data Science and Analytics” I can honestly say “no” and will be discontinuing this series from Columbia University.

I appreciate the hard work they put into this, but it’s apparent they do not have a comprehensive Data Science curriculum. This series is a mashup of lectures where they get some people from the math department, some people from the computer science department and some people from the applied sciences (i.e. biology), string together some unrelated lectures and call it a “Data Science” program.

This is not coming from someone who is bitter about the homework grades (I’ve scored very high on the practice problems), but after watching meandering lectures on NP-hard problems and then struggling with the language and bugs of the assignments, I’m sitting here realizing that I’m not moving forward in my interest in Data Science.

Aside from the material, which was a patchwork of irrelevant subjects, there were too many bugs in the quiz assignments (i.e. you had to type in 4.63 as 4.6 would be marked wrong without any indication of expected precision in the answer). In fact, I just checked back after having abandoned this course two weeks ago and see this announcement:

Dear Students,

We have removed homework 4e, Ensemble Classifiers, from grading. It will not count towards your overall grade in the course. In addition, we have changed the passing grade for this course to be 50% or higher.

Thank you for all your comments and feedback,

Course Team.

Sadly, similar announcements were posted every week. Stay away from this program.

Mathematical Monk – Machine Learning

imageThe 19-yr-old Data Science wunderkind, Safia Abdalla, turned me onto the Mathematical Monk, an enigmatic post-doc at Duke University who posted 160 (!) lecture videos on YouTube on various topics about Machine Learning.

Using a style reminiscent of Khan Academy, each lecture is between 10-15 minutes in length, given in a smooth voice and utilizing a colorful digital pen. Unfortunately, those are its only strengths.

I watched and took notes on the first 15 videos and found them to be unnecessarily confusing, far too heavy on the math theory (no practical examples covered), and very inconsistent on the variable notation. As a former Physics teacher (as well as having a degree in Mathematics), the biggest sin you can make in teaching math is to be inconsistent with your variables.  What is X vs x? Or X vs X vs x vs x. Does x represent a single value or a vector of values? It changes from lecture to lecture and sometimes within the lecture itself.

imageI know this guy is very smart, and far more of an expert in the field than I. But even so, early on he muffed the clichéd Cancer Test table, stating each square represented whether a test should be given, rather than the rates at which the test gives false positives and negatives.

Based on the quality of those first 15 videos, I did not proceed with the remaining 145. Maybe they got better over time?

Coursera – University of Washington – Data Science at Scale

Coursera offers various “Specializations” – groupings of courses given by the same University on a given topic (for example, the Johns Hopkins Data Science Specialization). You pre-pay up front at a discount and you have one year to complete all of the courses in the specialization, with the option to retake courses if your schedule forces you to do so. That flexibility and discount really works in favor of those student who are committed to complete it.

I’ve already expounded at length at how disappointing the three courses in the Data Science at Scale Specialization were. But I did watch all of the videos and completed the assignments. And I may have learned a thing or two – even if it fell below my expectations.

However, there is one final course in this specialization – the Capstone Project. So far, it’s been delayed for 4 consecutive months and there’s no indication that it will ever be released. I’ve asked for a refund and so far Coursera has refused.

image

Just initiated

I’ll write more about these at the end of March!