|As a reminder, a one-page summary of all the courses, books & videos
I’ve reviewed in the past year can be found on my Journey Roadmap page.
South Carolina is hot. We went to a “fall festival” in late September where the autumn temperature peaked at 100 degrees with nary a breeze to be found. For a few moments, I questioned my decision to relocate South, but I’m sure the payoff will occur in the months ahead.
I hit a big milestone this month in that my sabbatical from employment has come to an end and I have begun a job search in earnest. Although I’ve submitted my resume to a number of job postings, I’ve not had any bites yet – even for interviews. I’ve observed that Data Science job openings are in abundance around big cities (NYC, Boston, Washington DC, Chicago, San Francisco). I’m not seeing a ton of opportunity here in South Carolina (Greenville area), but I’m looking!
So here’s how I passed the time in September:
edX – Berkeley U – CS110x Big Data Analysis with Apache Spark
I’ve been pursuing 3 main areas of study:
- Apache Spark
I learned early on that the areas of Data Science and Big Data are very different, although they are often interchangeably used in mass-media articles. Big Data is the technical feat of handling massive amounts of data (storing and retrieving), in amounts more than any single computer or disk can handle. Data Science are the techniques for deriving useful and actionable information from data (of any size – not just large amounts).
Apache Spark is a very interesting bridge between the two. Like Hadoop, it is an architecture for handling data spread out over many clusters of machines (Big Data) and comes with a rich library of algorithms (Data Science) optimized for running in parallel over these many machines and combining the results together. You can work in Apache Spark using Java, Scala (a java-like language) or Python (pyspark).
This second course provided by Berkeley University on the EdX platform, dives into Machine Learning algorithms using Apache Spark. As in the first course (CS105x Introduction to Apache Spark) the lecture videos were brief and not very in-depth. However, the meat & potatoes were the various interactive labs where 90% of the learning occurs. These labs were extraordinary. They provided links to Spark API documentation where appropriate, walked you through simple examples before sending you on your own to code and provided many checkpoints to make sure you were doing things correctly.
Hosted online with the Community Edition of DataBricks (thank you!), these labs used real-world data and many different scenarios to work through. Each lab took the better part of a week to complete I did turn to the very active discussion forum for questions and hints. What I liked best about the labs were the repetition of major activities. For example, I would slog through the construction of a Machine Learning Pipeline to see how effective a Classification Tree model is and then be asked to do the same thing all over again for a RandomForest model. With each repetition, you gain confidence and understanding.
You should have a solid foundation in Python first before attempting this course. Familiarity with lambda functions, data types like lists and arrays and basic numpy array functions is required. You will also learn about special Apache Spark functionality as the machine learning library used is not the same as sklearn.
I had decided back in July that I wanted to go through this text again, taking notes like I did back in my college days, making sure I understood every page before turning to the next. I worked through all of the lab exercises, adding the various R packages and functions to my ever-growing list of Mnemosyne cards. I attempted every end-of-chapter assignment and gained a lot of confidence in my ability to execute various Machine Learning techniques off of the top of my head.
Once again, I am indebted to Amir Sadoughi who has published his unofficial solutions set to this book’s exercises. I checked my answers against his, found some techniques to add to my repertoire and even challenged a couple of his approaches!
This text is for those that enjoy the math aspect of Data Science as well as the data analytics part, so be prepared to see lots of equations – supported with paragraphs of explanatory text and numerous figures. This is a college-level text, not a “Data Science for Dummies” book. But I truly do believe that this is the level to which you should aspire to reach.
Revisited Coursera & UoW Machine Learning Courses
Back in August, I wrote about the 4th of 6 courses in the python-based Coursera & University of Washington: Machine Learning Specialization. Towards the end of the course, it was announced that Apple had purchased Turi, the Machine Learning platform founded by one of the course’s instructors and used throughout the specialization.
Since then, to the disappointment of the students pursuing this series of courses, the series has come to a halt and not much formal communication has been forthcoming regarding the future (namely the development of courses 5 & 6).
I took advantage of this lull in the action to review Courses 2-4 (Course 1 was just a high-level overview) and spend time reviewing the lecture notes and lab assignments. As with the Statistical Learning book above, the fast pace of the course with weekly deadlines has you sometimes focusing more on getting the assignments done than really appreciating the algorithms and methodology.
I worked through each lab again where the focus was primarily on building from scratch many of the routines covered in the lectures. For example, you build by hand, piece by piece, the functions that go into a gradient descent algorithm that’s used to find the optimal coefficients in linear regression.
You won’t use SciKit Learn or pandas here. Instead of pandas, you use Turi’s Graphlab SFrame architecture to most of the data loading and some of the machine learning routines. But most of the time, you are using standard python to build the ML algorithms yourself in a manner that really helps tie the theory to the computer code.
I’m still hoping that these series continues – it has been one of the better ones I’ve encountered so far.
This month, I was able to work through only half of this 440-page book, by Sebastian Raschka. The focus so far has been on SciKit Learn and the many machine learning functions and supporting routines contained therein. I thought the book got off to a rough start with implementing python classes for very early (like 1960’s early) attempts at machine learning such as the Perceptron and Adaptive Linear Neuron models. Once I got past this chapter, I saw that it was primarily there for historical purposes and all of the effort in understanding the pages of code were really not necessary.
After this minor speedbump, this book really becomes a tour-de-force review of the sklearn package. Using a variety of real-world datasets, a generous amount of sample code was provided to show how the concepts translated into application. Concepts covered in the first half of this book included:
- Logistic regression
- Decision Trees
- Principal Component Analysis
- Kernel functions
- Hyperparameter tuning
- Ensemble learning like bagging and boosting
This is a difficult but very rewarding book to read. In some places, it could be daunting to experience a topic for the first time. But if you’ve encountered the topic before, you get a much deeper understanding of the concept. For me, I had an A-HA! moment with Support Vector Machines even though I had encountered the subject in two other courses before.
You will need a solid foundation in Python first, as you encounter Python classes right from the first few pages. Also, numpy and pandas are assumed to be part of the reader’s knowledgebase as well. The sample code is well documented and (as far as I can tell) bug free! I’m looking forward to the second half of the book where there are topics covered that I haven’t seen before (like “Embedding a Machine Learning Model into a Web Application”).
- edX – Berkeley U – CS120x Distributed Machine Learning with Apache Spark
- Finishing Python Machine Learning
- Coursera – The R Programming Environment – the start of a new specialization that focuses on R as a programming development platform
- Practical Data Science with R – a book I purchased months ago but never got around to reading
- Think Bayes: Bayesian Statistics Made Simple – a freely available book whose author spoke at ODSC in Boston.
- Coursera – Process Mining: Data Science in Action – a potentially interesting spin on Data Science which applies many of the techniques learned so far to the data collected by corporate ERP systems (like SAP, Oracle, etc) to derive and enhance business process understanding. Having lived in the corporate world for the past 19 years, I’m intrigued by the concept and will be giving this course a trial.
- Once again enter the ranks of the employed!