Saturday 2 March 2013

Machine learning endeavour- Mid term review

Mid term review
It’s been a few months since I posted my machine learning goals, and this seems an appropriate time to review what’s been done so far and what’s left to go.
I had started out with a goal of 5 ML related courses, and I am on track to do a little more than I had planned. These are the courses that I've now targeted for completion (all @ Coursera)
No
Course name
Completed
Certificate
1
Probabilistic Graphical Models
June 2012
2
Computing for Data Analysis
Dec 2012
3
Machine Learning
Dec 2012
4
Neural Networks for machine learning
Dec 2012
5
Data Analysis
in progress
6
Linear and Discrete Optimization
in progress
7
Computational Methods for Data Analysis
in progress
non-certificate course
8
Natural Language Processing
in progress
I also took a large part of the “Design and Analysis of Algorithms I” course which I could not complete due to overlapping course schedules.
My take on each of these courses is in another blog post.  
I had another goal of entering 4 kaggle competitions and finishing in the top 50%, which appears to be a little more difficult to achieve. I entered more than 2 competitions, and with my best placing being 55/940 in one (digit recognizer training) competition, with no significant placings in others.
My perspective is that competitions are probably not the best place to learn skills for a few reasons.
One, competition data usually needs quite a bit of pre-processing and exploratory analysis to figure out the approach.
Secondly, most competition data is not small enough to fit in a typical PC’s memory, and thus will require some big data and/or parallelization approaches such as Apache Mahout/CUDA and the knowledge of the accompanying toolset. Thus a significant part of your time would be spent in learning a toolset which is useful, but orthogonal to the goal of getting feedback on your machine learning approach.
Third, to get high on the leaderboard for a competition usually requires ensemble methods, or in other words, trying out several models and then using model averaging.

However, Kaggle competitions are a great way to hone your skills once you have a good grasp of the toolset that you plan to use, and the ups and downs on the leaderboard are quite a bit more exciting than assignments in a course. And its certainly possible to get decent results with simpler methods such as regression.
Hopefully my next ‘status update’ would see me close to completion. :)

No comments: