Sunday, 14 April 2013

Data Analysis @ Coursera - a review


Here is my review of the Data Analysis course at Coursera which I recently completed.

There are several "data X" and "big data Y" kind of courses nowadays, and its quite difficult to know up front if the course you signed up for is the course you need. I'll try to outline what this particular course is, and what you can expect from it.

First off, this is a Data Analysis course in R. Knowing R is a prerequisite and if you come to this course without any knowledge of R expecting to pick up the basics along the way, it will be quite challenging. Completing Prof. Roger Peng's R course is the ideal way to ease into the material for this course.

This course teaches you statistics while trying to make sense of the data. There are innumerable data sets that are explained, and playing with data is the ideal tool to practically understand those statistics concepts. The initial part of the course tends to repeat R's strengths in graphing and data cleaning and munging. The initial part of the course goes at a relaxed pace, and somewhere in the middle the speed picks up and the last few weeks become quite hectic. The second part of the course is pretty much a machine learning course, with clustering and classification algorithms being explained in quick succession. If you have been through Prof. Andrew Ng's machine learning course, the difference here is that very little mathematics involved. The classification methods used (such as random forests, which I had never used before) are explained from a "how to use it" point of view and the math basics are not covered.

Since it does not try to get into the mathematical basis of every method, it covers much more ground, such as ensemble learning and ways to do model averaging. Although the knowledge of math is certainly useful, this course showed that it is possible to do predictive modelling quite effectively simply by knowing the methods and learning how to apply them. It is therefore a practitioner's course in Data Analysis.

The difference between this course and the Machine learning course, is that this one is much more exploratory. Often in machine learning problems, the goal is often to just get to the lowest mis-classification rate on the validation/test set. Here, the emphasis is much more on interpreting and explaining the data the data (usually graphically) and understanding how a few features (especially if the dataset has relatively many dimensions) are responsible for most of the variance. I often struggled with this, because in a classification problem, it was relatively easy to do dimensionality reduction (using Principal Component Analysis) and then use multiple classifiers such as SVM/Neural Nets/Random forests, while it was relatively hard to explain feature variance on data that has been processed through PCA.

The assignments in this course also reflect its exploratory nature and are peer assessed using a rubric. Both assignments requires one to write a data analysis explaining the motivations, the methods to clean the data, methods used to classify or cluster, the statistical tests used and so on. So sticking to what the rubric demands is quite important, and straying from it, even though your write-up is excellent, leads to lower scores.

All said and done, this is an excellent course to improve your knowledge of data analysis, statistics and machine learning.

Here are my suggestions to make this course even better:

  • More assignments. The assignments are quite big, and they take a bit of time. I would prefer shorter assignment that gives one the opportunity to play with more data sets.
  • The course probably tries to cover too much material. I was quite confused by the various tests for statistical significance which were explained in bits and pieces in various parts of the course, and only later did I develop some understanding of what should be used in which case.  If this course included more material, it can be certainly split into a basic and advanced data analysis course. 
  • I would love to have this course offered in a language agnostic way. For simple one and two liners, R is bearable. Writing code longer than that makes one itchy for a 'real' language (I'm working my way through Data analysis in Clojure, hopefully with the knowledge gained there I could finally say goodbye to R). 
  • The course should probably include some material on handling analysis in big data. R holds its data  largely in-memory, and datasets that push this limit will make any analysis difficult. I think Clojure/Incanter is a good combination of an excellent language married to a robust toolkit, but I'm yet to run classification on large data sets using that toolkit, so that calls for more experimentation (and a blog post too).