Saturday, 23 February 2013

Machine learning- using Kaggle Digit Recognizer as a test bed for ML algorithms

Originally posted on Posterous (Sept 24 2012), which is shutting down

The Machine Learning course by Prof. Andrew Ng is an excellent introduction to machine learning(ML henceforth). While doing the course, I came across Kaggle's handwriting digit recognition problem. The digit recognition problem is a Kaggle 'non-competition' that works as a test bed for ML.
What follows is a series of blog posts that describes how the algorithms taught in the Coursera ML course can be applied to the digit recognition problem.
One of the algorithms taught in the ML course is logistic regression. Logistic regression is an algorithm used in classification problems, such as classifying email as spam and non-spam. Digit recognition is also a classification problem. While classifying email as spam and non-spam requires 2 classes, digit recognition requires an image of a digit to be classified as any single digit numbers from  0 to 9, or into 10 classes. 
The available data:
Kaggle provides 28000 images of training data in a csv file. Each image is represented by a 28x28 pixel matrix. For logistic regression, this can be thought of as a feature vector with 784 features. The value of lambda (a parameter to prevent overfitting) has been fixed at 0.1. 
The algorithm:
The algorithm learns the parameters (784 plus one) using the data on the training set, and then computes the class of the digit (or rather, makes a prediction on the digit) for each image in the test set. Since multi-class logistic regression is used, the algorithm will output 10 scores per input image, one for each digit. The digit with the highest score is taken to be output.   
The output:
The classification accuracy using multi-class regularized logistic regression came to approximately 91%. Kaggle mentions that only a part of the test set is evaluated, so when the entire test set is evaluated, the score could be better or worse.
Next steps:
  • Improve the value of the lambda parameter to get a better fit
  • Use a different algorithm such as SVM or Artificial Neural Networks
The octave source code can be found in the Github repo.


Priya Kannan said...

Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us and I never get bored while reading your article because, they are becomes a more and more interesting from the starting lines until the end.
Python Training in Chennai

Dipanwita said...

This is a great post. Will look forward to more such posts. python training in Chennai