Saturday, 23 February 2013

Kaggle digit recognizer using SVM

Originally posted on Posterous(Oct 7 2012) which is shutting down
This is the second post in the series, solving Kaggle digit recognition with the algorithms taught in the Coursera ML course.
For the second try, Support Vector Machine was the algorithm used. SVMs were invented (or discovered) in 1995 and quickly became the tool of choice for both regression and classification. Although multi-class classification in SVM is not as straightforward as in logistic regression, the prominent tool kits (LIBSVM was the one I used) have support for it.
I first did some pre-processing to scale the digit image values from 0-255 to 0 to 1, which is recommended by SVM. Some clojure snippets that do the same are attached.
Loading ....

I then used libsvm to first find the optimum values of the constants (C & Gamma) used in SVM. The only difficulty I encountered was that when using the
"-v" option, libsvm would not generate the model file. I had to resort to using the python interface to save the model file and then run svm-predict to get the predictions.
Note that the test file also needs to be scaled to the same values (from 0-255 to 0-1), and the test file needs to have a dummy value of 'y' (which should be between 0 and 9).
With vanilla SVM the kaggle scores were at 97% :)

Machine learning- using Kaggle Digit Recognizer as a test bed for ML algorithms

Originally posted on Posterous (Sept 24 2012), which is shutting down

The Machine Learning course by Prof. Andrew Ng is an excellent introduction to machine learning(ML henceforth). While doing the course, I came across Kaggle's handwriting digit recognition problem. The digit recognition problem is a Kaggle 'non-competition' that works as a test bed for ML.
What follows is a series of blog posts that describes how the algorithms taught in the Coursera ML course can be applied to the digit recognition problem.
One of the algorithms taught in the ML course is logistic regression. Logistic regression is an algorithm used in classification problems, such as classifying email as spam and non-spam. Digit recognition is also a classification problem. While classifying email as spam and non-spam requires 2 classes, digit recognition requires an image of a digit to be classified as any single digit numbers from  0 to 9, or into 10 classes. 
The available data:
Kaggle provides 28000 images of training data in a csv file. Each image is represented by a 28x28 pixel matrix. For logistic regression, this can be thought of as a feature vector with 784 features. The value of lambda (a parameter to prevent overfitting) has been fixed at 0.1. 
The algorithm:
The algorithm learns the parameters (784 plus one) using the data on the training set, and then computes the class of the digit (or rather, makes a prediction on the digit) for each image in the test set. Since multi-class logistic regression is used, the algorithm will output 10 scores per input image, one for each digit. The digit with the highest score is taken to be output.   
The output:
The classification accuracy using multi-class regularized logistic regression came to approximately 91%. Kaggle mentions that only a part of the test set is evaluated, so when the entire test set is evaluated, the score could be better or worse.
Next steps:
  • Improve the value of the lambda parameter to get a better fit
  • Use a different algorithm such as SVM or Artificial Neural Networks
The octave source code can be found in the Github repo.

Hacking my education- Machine learning Endeavour

Originally posted on posterous (Oct 9 2012) which is shutting down.
I have had an interest in machine learning since a few years. In spite of the numerous resources available online, I had never really gotten to a system of organized study due to various reasons.
With the advent of Coursera and other MOOCs, it has become relatively easy to gain knowledge in a particular field of study, with all the advantages of a classroom-like environment (great teachers, a well-rounded and structured syllabus and motivated fellow students) without the attendant disadvantages (cost, distance and time).
When I started on this endeavour in Dec 2011, I did not think I would get this far. Having left college 12 years ago, it was quite an effort to get starting studying ML, and I was told that my getting into machine learning would not be easy because of the math prerequisites. However, close to 10 months down the line, I realize that nothing is impossible.
So this is my attempt to “enlist the goals” in my machine learning endeavour and chart my progress so far.  I’ve checked the syllabi of a few master’s courses in Machine Learning, and here are some of the courses that are required to be taken.  
I plan to finish at least 5 courses related to ML before May 31st 2013, from the ones in this list:
  1. Probabilistic  Graphical Models @ Coursera
  2. Machine Learning @ Coursera
  3. Neural Networks @ Coursera
  4. Computing for Data Analysis (a course on the R language) @ Coursera
  5. Image and Video processing @ Coursera
  6. Natural Language processing @ Coursera
  7. Optimization @ Coursera
  8. Statistics 101 @ Udacity.
Real world machine learning is more than just taking courses, it should involve working on some projects. Since I’m mostly self studying, it would be difficult to find a proctored project to work on. However, there’s Kaggle(a site that hosts Machine learning competitions)  to the rescue. My goal is to compete and finish at least 4 competitions in the top 50% of the field.
My progress so far:
  1. A course in Linear Algebra @ GuruPrevails-completed Jan 2012
  2. Probability & Statistics @ GuruPrevails-completed Mar 2012.
  3. Probabilistic  Graphical Models @ Coursera – completed successfully in June 2012
  4. Machine Learning @ Coursera – in progress, ends Oct 2012.
  5. Neural Networks @ Coursera – in progress, ends Dec 2012
  6. Computing for Data Analysis (a course on the R language) @ Coursera – in progress, ends Oct 2012.
  7. Algorithms – Design and Analysis @ Coursera – didn’t complete beyond 4thassignment.
  8. Kaggle digit recognizer “learning” competition: currently 23rd/400 teams (Oct 2012).
Along the way, I’ve realized that my learning methods are sub-optimal, there are other ways to learn than just ‘being motivated and working hard’. I signed up for the “learn more study less” course by Scott H Young, and although it is too early to see any results, the course material is certainly worth it.
I’m off to a much needed break to Roopkund (it’s been 4 years since I hiked in the Himalayas L) , and its back to the Machine Learning Endeavour after that!

Clojure zippers - 2

Continuation of the earlier post on zippers.

So what if you want a tree with content in the parent nodes? Something like this:

We'll have to use records or maps as each node, and then use functions (usually keys) to get the content from the parent. Alex has a blog post on using records, here I'll stick with using a map
Loading ....

Note that "makenode" simply has to assoc(iate) the provided "children" vector to the existing map to add children.
On a side note, source is an excellent introduction on how to use metadata.

Paths to the root node
Although has the 'path' function that gives the list of nodes traversed from the root to the current node, it contains the whole node rather than just the content of :self. That is easily remedied thus:
Loading ....

Adding children to all leaf nodes

Adding children to all the leaf nodes is a useful feature in programming a board game, where all the prospective moves are stored in a tree, and we wish to look 4/5/6/n moves ahead. In such cases, we can grow the leaf node (find out the next possible set of moves).

Loading ....
We use a different function called add-children to create a vector of children to attach to each leaf node

clojure zippers

I bumped into Clojure's zipper data structure while trying to implement aMinimax algorithm for a game. There are some excellent resources on zippers, Brian's is a good introduction
Why should you use zippers? That's a good question, when you already have tree-seq in clojure/core.
Tree-seq is good for navigating a tree, printing out its nodes. Zipper provide the ability to edit a tree, which tree-seq does not.
A couple of things stumped me when I first encountered zippers. 
One of them is that a zipper declared with vector-zip doesn't have any content in a parent node. In other words, all the number elements of the following vector are leaf nodes.

Loading ....

Brian calls the head of the tree the "tomorrow tree". Actually, all the parent elements (or those that have branches), are vectors (or the tomorrow tree) and do not contain any "content".  And here's how you could visualize the tree

Tree editing

Since clojure is functional, each tree edit returns a new (copy of) tree. So the basic idea is, do some edits and call zip/root to get the root of the tree. Note that zip/root calls zip/node on the tree, so you need to create a zipper again if you want to traverse the edited tree.
Loading ....