The Journey’s Roadmap

Herein lies the overall roadmap of the Data Science Journey I began back in June, 2015. It is current as of Sept, 2016.

This is my roadmap. It may not, and most likely will not, work for you, dear reader. It was improvised, full of false starts, trailbacks, dead ends, shortcuts and lost stints in the woods.

But perhaps the breadcrumb trail below would be of some value for you in finding your own way. I’ve included links with more details for each course/book/tutorial. Look even in the headers for links. Godspeed and good luck!

Coursera – Data Science Specialization

Title Topics Reviewed
(5 star rating)
The Data Scientist’s Toolbox Introduction to R, RStudio, Git and Github July 2015
R Programming Overview of R, R data types and objects, reading and writing data; Control structures, functions, scoping rules, dates and times; Loop functions, debugging tools; Simulation, code profiling Aug 2015
Getting and Cleaning Data Obtaining data from the web, from APIs, from databases and in various formats. Data cleaning, tidy data, downstream data analysis tasks. Raw data, processing instructions, codebooks, and processed data. Collecting, cleaning, and sharing data. Aug 2015
Exploratory Data Analysis Summarizing data, sharpening potential hypotheses about data, plotting systems, principles of constructing data graphics, multivariate statistical techniques used to visualize high-dimensional data. Sept 2015
Reproducible Research Using R markdown using knitr, integrate live R code into a literate statistical program, organize a data analysis so that it is reproducible and accessible to others. Oct 2015
Statistical Inference Perform inferential tasks in highly targeted settings, be able to use  the skills developed as a roadmap for more complex inferential challenges. Nov 2015
Regression Models Fit regression models, interpret coefficients, investigate residuals and variability.  Special cases of regression models including use of dummy variables and multivariable adjustment. Poisson and logistic regression. Dec 2015
Practical Machine Learning Model based and algorithmic machine learning methods. Regression, classification trees, Naive Bayes, and random forests. Build prediction functions including data collection, feature creation, algorithms, and evaluation. Jan 2016
Developing Data Products Create interactive, online applications using shiny.io, RCharts, plotly and other tools. Generate HTML5 presentations from R code using slidify and RStudio Presenter. Develop and publish your own R library packages. Feb 2016

Coursera – Data Science at Scale Specialization

Data Manipulation at Scale: Systems and Algorithms Common patterns, challenges, and approaches associated with data science projects. Programming models associated with scalable data manipulation, including relational algebra, mapreduce, and other data flow models. Database technology adapted for large-scale analytics, including the concepts driving parallel databases, parallel query processing, and in-database analytics Key-value stores and NoSQL systems. MapReduce, Hadoop and Spark. Big Data systems for graphs, arrays, and streams Nov 2015
Practical Predictive Analytics: Models and Methods Design effective experiments and analyze the results. Resampling method. Classification methods (rules, trees, random forests), associated optimization methods (gradient descent and variants) Unsupervised learning concepts and methods. Large-scale graph analytics, including structural query, traversals and recursive queries, PageRank, and community detection Nov 2015
Communicating Data Science Results Graph Analysis in the Cloud. Elastic MapReduce and the Pig language to perform graph analysis over a moderately large dataset, about 600GB. Visualization: effective communication of quantitative results by linking perception, cognition, and algorithms to exploit the enormous bandwidth of the human visual cortex. importance of reproducibility in data science and how the commercial cloud can help support reproducible research Jan 2016
Capstone Project Who knows? This course has never been made available to students who prepaid for this specialization. Coursera is balking at any form of refund.

**Update** The Capstone project was finally released at the end of March 2016 (months after I had completed the last course). But by that point, I lost interest in this specialization altogether and had moved on.

Feb 2016
(the coveted 0-star rating)

Coursera & University of Washington: Machine Learning Specialization

Machine Learning Foundations: A Case Study Approach “Hands-on experience with machine learning from a series of practical case-studies.  At the end of the first course you will have studied how to predict house prices based on house-level features, analyze sentiment from user reviews, retrieve documents of interest, recommend products, and search for images.  Through hands-on practice with these use cases, you will be able to apply machine learning methods in a wide range of domains”This is the first course in the six-part Machine Learning Specialization. Mar 2016
Coursera – Machine Learning: Regression “In this course, you will explore regularized linear regression models for the task of prediction and feature selection.  You will be able to handle very large sets of features and select between models of various complexity.  You will also analyze the impact of aspects of your data — such as outliers — on your selected models and predictions.  To fit these models, you will implement optimization algorithms that scale to large datasets.”
This is the second course in the six-part Machine Learning Specialization.
Apr 2016
Coursera – Machine Learning: Classification “In this course, you will create classifiers that provide state-of-the-art performance on a variety of tasks.  You will become familiar with  the most successful techniques, which are most widely used in practice, including logistic regression, decision trees and boosting.  In addition, you will be able to design and implement the underlying algorithms that can learn these models at scale, using stochastic gradient ascent.  You will implement these technique on real-world, large-scale machine learning tasks.  You will also address significant tasks you will face in real-world applications of ML, including handling missing data and measuring precision and recall to evaluate a classifier.”
This is the third course in the six-part Machine Learning Specialization.
May 2016
Coursera – Machine Learning: Clustering and Retrieval “In this third case study, finding similar documents, you will examine similarity-based algorithms for retrieval. In this course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce.”
This is the fourth course in the six-part Machine Learning Specialization.
Aug 2016

edX & Columbia U: Data Science and Analytics XSeries

Statistical Thinking for Data Science and Analytics You will learn how data scientists exercise statistical thinking in designing data collection, derive insights from visualizing data, obtain supporting evidence for data-based decisions and construct models for predicting future trends from data.This is the first course in the three-part Data Science and Analytics XSeries.” Jan 2016
Machine Learning for Data Science and Analytics What machine learning is and how it is related to statistics and data analysis. How machine learning uses computer algorithms to search for patterns in data. How to use data patterns to make decisions and predictions with real-world examples from healthcare involving genomics and preterm birth. How to uncover hidden themes in large collections of documents using topic modeling. How to prepare data, deal with missing data and create custom data analysis solutions for different industries. Basic and frequently used algorithmic techniques including sorting, searching, greedy algorithms and dynamic programming Feb 2016

Other Data Science Related Courses

Coursera – Machine Learning Advanced machine learning algorithms. Anti-spam, image recognition, clustering, recommender systems. Linear and logistic regression. Neural networks, support vector machines, unsupervised learning, anomaly detection, recommender systems, large scale machine learning. Sept 2015
Kaggle & the Titanic Data Tutorial “Always wanted to compete in a Kaggle competition but not sure you have the right skillset? This interactive tutorial by Kaggle and DataCamp on Machine Learning offers the solution. Step-by-step you will learn through fun coding exercises [using R] how to predict survival rate for Kaggle’s Titanic competition using Machine Learning techniques. Upload your results and see your ranking go up!” Dec 2015
Stanford’s Statistical Learning Introductory-level course in supervised learning, with a focus on regression and classification methods. The syllabus includes: linear and polynomial regression, logistic regression and linear discriminant analysis; cross-validation and the bootstrap, model selection and regularization methods (ridge and lasso); nonlinear models, splines and generalized additive models; tree-based methods, random forests and boosting; support-vector machines. Some unsupervised learning methods are discussed: principal components and clustering (k-means and hierarchical). Feb 2016
Mar 2016
Mathematical Monk – Machine Learning 160 Khan-Academy like lectures about Machine Learning available on YouTube. Feb 2016
Podcasts I Listen To A list of 7 Data-Science related podcasts that I listen to every week. Mar 2016
Stanford Natural Language Processing Video Series “This course covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, and question answering, We will also introduce the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.” Apr 2016

Apache Spark

Udemy – Taming Big Data with Apache Spark and Python – Hands On! Learn and master the art of framing data analysis problems as Spark problems through over 15 hands-on examples, and then scale them up to run on cloud computing services in this course.

  • Learn the concepts of Spark’s Resilient Distributed Datastores
  • Develop and run Spark jobs quickly using Python
  • Translate complex analysis problems into iterative or multi-stage Spark scripts
  • Scale up to larger data sets using Amazon’s Elastic MapReduce service
  • Understand how Hadoop YARN distributes Spark across computing clusters
  • Learn about other Spark technologies, like Spark SQL, Spark Streaming, and GraphX

By the end of this course, you’ll be running code that analyzes gigabytes worth of information – in the cloud – in a matter of minutes.

May 2016
edX – Berkeley U – CS105x Introduction to Apache Spark This statistics and data analysis course will teach you the basics of working with Spark and will provide you with the necessary foundation for diving deeper into Spark. You’ll learn about Spark’s architecture and programming model, including commonly used APIs. After completing this course, you’ll be able to write and debug basic Spark applications. This course will also explain how to use Spark’s web user interface (UI), how to recognize common coding errors, and how to proactively prevent errors. The focus of this course will be Spark Core and Spark SQL. The course assignments include word counting and Web Server Log Mining using real world datasets and parallel processing with PySpark.

This is the first course in the Data Science and Engineering with Apache Spark series.

Aug 2016
edX – Berkeley U – CS110x Big Data Analysis with Apache Spark This course will attempt to articulate the expected output of Data Scientists and then teach students how to use PySpark (part of Apache Spark) to deliver against these expectations. The course assignments include Prediction using Machine Learning algorithms, Collaborative Filtering, and Textual Entity Recognition exercises that teach students how to manipulate datasets using parallel processing with PySpark, Spark SQL, and Spark Machine Learning Pipelines. This course covers advanced undergraduate-level material. It requires a programming background and experience with Python, PySpark (part of Apache Spark), and Spark SQL.

This is the second course in the Data Science and Engineering with Apache Spark series.

Sept 2016

Data Science Books

Data Smart: Using Data Science to Transform Information into Insight Working primarily in Excel: Mathematical optimization, including non-linear programming and genetic algorithms. Clustering via k-means, spherical k-means, and graph modularity. Data mining in graphs, such as outlier detection. Supervised AI through logistic regression, ensemble models, and bag-of-words models. Forecasting, seasonal adjustments, and prediction intervals through monte carlo simulation. Moving from spreadsheets into the R programming language Aug 2015
and again
Dec 2015
R and Data Mining – Examples and Case Studies Importing data from files and databases; Decision trees and random forests; Regression; Clustering; Outlier detection; Time Series Analysis; Text mining Jan 2106
An Introduction to Statistical Learning with Applications in R Textbook that accompanies the Stanford Statistical Learning Course above. All of the same topics are covered, but this book stands alone due to its richness of material, hands-on R labs and productive exercises. Mar 2016
Sep 2016

Learning Bayesian Models with R “Learning Bayesian Models with R starts by giving you a comprehensive coverage of the Bayesian Machine Learning models and the R packages that implement them. It begins with an introduction to the fundamentals of probability theory and R programming for those who are new to the subject. Then the book covers some of the important machine learning methods, both supervised and unsupervised learning, implemented using Bayesian Inference and R.” Mar 2016
Data Science from Scratch: First Principles with Python “Data science libraries, frameworks, modules, and toolkits are great for doing data science, but they’re also a good way to dive into the discipline without actually understanding data science. In this book, you’ll learn how many of the most fundamental data science tools and algorithms work by implementing them from scratch.If you have an aptitude for mathematics and some programming skills, author Joel Grus will help you get comfortable with the math and statistics at the core of data science, and with hacking skills you need to get started as a data scientist. Today’s messy glut of data holds answers to questions no one’s even thought to ask. This book provides you with the know-how to dig those answers out.” Apr 2016
Getting Started with Python Data Analysis “The book starts by introducing the principles of data analysis and supported libraries, along with NumPy basics for statistic and data processing. Next it provides an overview of the Pandas package and uses its powerful features to solve data processing problems.Moving on, the book takes you through a brief overview of the Matplotlib API and some common plotting functions for DataFrame such as plot. Next, it will teach you to manipulate the time and data structure, and load and store data in a file or database using Python packages. The book will also teach you how to apply powerful packages in Python to process raw data into pure and helpful data using examples.

Finally, the book gives you a brief overview of machine learning algorithms, that is, applying data analysis results to make decisions or build helpful products, such as recommendations and predictions using scikit-learn.”

May 2016
Python Machine Learning “Python Machine Learning gives you access to the world of machine learning and demonstrates why Python is one of the world’s leading data science languages. If you want to ask better questions of data, or need to improve and extend the capabilities of your machine learning systems, this practical data science book is invaluable. Covering a wide range of powerful Python libraries, including scikit-learn, Theano, and Keras, and featuring guidance and tips on everything from sentiment analysis to neural networks, you’ll soon be able to answer some of the most important questions facing you and your organization.” Sept 2016

Mathematics

Khan Academy – Probability and Statistics Independent and dependent events; probability and combinatorics, descriptive statistics, random variables and probability distributions, regression, inferential statistics July 2015
Comprehending Behavioral Statistics Introduction to statistics, normal and t distributions, hypothesis testing, ANOVA Aug 2015
Coursera – Data Analysis and Statistical Inference Use statistical software (R) to summarize data numerically and visually, and to perform data analysis. Apply estimation and testing methods (confidence intervals and hypothesis tests) to analyze single variables and the relationship between two variables. Model and investigate relationships between two or more variables within a regression framework. Interpret results correctly. Complete a research project that employs simple statistical inference and modeling techniques Nov 2015

(note the 6 star rating!)
Statistics for Dummies Random variables; the binomial, normal, t-, and sampling distributions; and the Central Limit Theorem. Data analysis tools for regression, confidence intervals, hypothesis tests, and two-way tables. Dec 2015
Statistics II for Dummies Key topics as sorting and testing models, using regression to make predictions, performing variance analysis (ANOVA), drawing test conclusions with chi-squares, and making comparisons with the Rank Sum Test. Jan 2016
Statistics Done Wrong “guide to the most popular statistical errors and slip-ups committed by scientists every day, in the lab and in peer-reviewed journals.” Feb 2016

Programming

Codecademy – Python Python syntax. Strings & Console input. Date & time. Conditionals and control flow. Functions. Lists, dictionaries and functions. Loops. Bitwise operations. Classes. File input/output. July 2015
Programming for Everybody (Getting Started with Python) Introduction to programming, variables, conditional code, functions, loops and iteration Aug 2015
Python Data Structures Strings, files, lists, dictionaries, tuples Aug 2015
Hands-on Programming with R Load data, assemble and disassemble data objects, navigate R’s environment system, write your own functions, and use all of R’s programming tools Nov 2015
Hello! Python The building blocks of Python programming and gives you a gentle introduction to more advanced topics such as object-oriented programming, functional programming, network programming, and program design. Dec 2015
Python for Informatics “The goal of this book is to provide an Informatics-oriented introduction to programming. The primary difference between a computer science approach and the Informatics approach taken in this book is a greater focus on using Python to solve data analysis problems common in the world of Informatics.” Jan 2016
Learning Python Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutz’s popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. It’s an ideal way to begin, whether you’re new to programming or a professional developer versed in other languages.Complete with quizzes, exercises, and helpful illustrations, this easy-to-follow, self-paced tutorial gets you started with both Python 2.7 and 3.3— the latest releases in the 3.X and 2.X lines—plus all other releases in common use today. Feb 2016
DataCamp – Intro to Python for Data Science & Intermediate Python for Data Science Python basics: variables, comments, types, lists, functions and packages. Introduction to numpy, matplotlib and pandas. Mar 2016
qwikLABS Amazon Web Services Tutorials Hands-on labs activities covering such topics as Creating Amazon EC2 instances (Linux & Windows), Elastic Load Balancing, Working with Amazon Redshift and Exploring Google Ngrams with Amazon EMR Mar 2016

General Books

Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die “An introduction for everyone. In this rich, fascinating — surprisingly accessible — introduction, leading expert Eric Siegel reveals how predictive analytics works, and how it affects everyone every day. Rather than a “how to” for hands-on techies, the book serves lay readers and experts alike by covering new case studies and the latest state-of-the-art techniques.” Jan 2016
The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy “Bayes’ rule appears to be a straightforward, one-line theorem: by updating our initial beliefs with objective new information, we get a new and improved belief. To its adherents, it is an elegant statement about learning from experience. To its opponents, it is subjectivity run amok.In the first-ever account of Bayes’ rule for general readers, Sharon Bertsch McGrayne explores this controversial theorem and the human obsessions surrounding it. She traces its discovery by an amateur mathematician in the 1740s through its development into roughly its modern form by French scientist Pierre Simon Laplace. She reveals why respected statisticians rendered it professionally taboo for 150 years at the same time that practitioners relied on it to solve crises involving great uncertainty and scanty information, even breaking Germany’s Enigma code during World War II, and explains how the advent of off-the-shelf computer technology in the 1980s proved to be a game-changer. Today, Bayes’ rule is used everywhere from DNA de-coding to Homeland Security.Drawing on primary source material and interviews with statisticians and other scientists, The Theory That Would Not Die is the riveting account of how a seemingly simple theorem ignited one of the greatest controversies of all time.” Apr 2016
The Signal and the Noise: Why So Many Predictions Fail – but Some Don’t (Nate Silver) Nate Silver built an innovative system for predicting baseball performance, predicted the 2008 election within a hair’s breadth, and became a national sensation as a blogger—all by the time he was thirty. He solidified his standing as the nation’s foremost political forecaster with his near perfect prediction of the 2012 election. Silver is the founder and editor in chief of FiveThirtyEight.com.

Drawing on his own groundbreaking work, Silver examines the world of prediction, investigating how we can distinguish a true signal from a universe of noisy data. Most predictions fail, often at great cost to society, because most of us have a poor understanding of probability and uncertainty. Both experts and laypeople mistake more confident predictions for more accurate ones. But overconfidence is often the reason for failure. If our appreciation of uncertainty improves, our predictions can get better too. This is the “prediction paradox”: The more humility we have about our ability to make predictions, the more successful we can be in planning for the future.”

Aug 2016
Advertisements