Wednesday 13 March 2013

Kaggle event recommendation competition


 


This post documents my attempts on the Kaggle event recommendation competition.

This competition was the first non-training competition that I entered, and it was a good learning experience. The competition's goal was to classify a set of events that are presented to a user, sorted by how likely that particular user will click "Yes" to attending that event. 

The data consists of several 'parts', such as:
1.      A given user's recommendation of 'yes', or 'no' for an event.
2.      Some data about the user such as location, gender, age etc
3.      Some data about the event, such as the time and place.
4.      A list of friends for a set of users
5.      A list of attendees for a set of events.
My initial approach was to treat this as a classification problem, when given a [user,event] tuple, to assign a score of 1 if the user is likely to click on Yes for that event. I started out by hand-coding features such as
1.      percentage of friends who marked yes, no, maybe to that event.
2.      Parsing the location fields of the event and user to determine city, state and country and a set of binary features for the user being in the country, state and city as the event.
3.      The event popularity.
4.      Time based features, such as whether a user is old enough to visit an event, and the difference in hours between the user seeing the event, and the event starting.

Approaches


I initially used the SVM classifier from libSVM but I ran into a couple of issues with libsvm. One, the scaling feature of libsvm didn't work very well (or rather the results were worse than when I scaled them outside libsvm), and second, a lot of times libsvm would predict the same class forall the tuples in the test set.
I then used a logistic regression classifier with a tiny (0.005) regularization value, which gave a classification rate of about 74%. The submission to kaggle resulted in a 47% success rate, which was only better than random values, but worse than the result sorted by popularity of the event.



At first sight, this competition resembled a typical collaborative filtering(CF) problem where each user gives a 'rating' to an event of 1 or 0. The only catch was, that in addition to the user-event matrix, there are additional features for both users and event that could not be directly used in CF solvers. 

My second approach was to solve the CF problem. I had a choice of using Restricted Boltzmann Machines (described quite nicely in the Neural networks course @ Coursera), or using the normal matrix factorization approach. I used the excellent Graphlib package which has a toolkit for solving CF problems. The input is the user-event sparse matrix with 1's where ever a user marked yes for an event. I used the ALS solver (which tackles the problem of finding low rank approximation of the recommendation matrix) in the CF toolkit, which gave an error of 7% after some experimentation with step sizes. My submission with the results from the ALS solver again gave a similar score of 45% on the kaggle leader board.

I had then given up on the comp, when I got an idea of averaging the 2 models used. On averaging the score from both the models with equal weights, the score on the leader board exceeded the popularity measure by a small bit, about 51%. 

What I learnt from this competition


Apart from the scoring and algorithms bit, I realized that I needed a whole new set of skills for such data analysis. The most important was the tools to clean and sort data. Since I had some experience with Clojure, I decided to put it to test. Initially I tried to load the data in-memory, which as expected, resulted in OutofMemoryErrors despite having 16G of RAM. Fortunately, I discovered Incanter, which gives Clojure capabilities similar to R. (As an aside, if you're looking at playing with R, do check out the Data Analysis course at Coursera by Jeff Leek, the capabilities of R described in the course are a revelation for the R newbie). Incanter too had similar memory issues, but thanks to its excellent integration with mongodb, I was able to load the datasets into mongodb and dissect the data as required. 

The second facet which was new to me was the dirty-ness and variation among data. There were multiple sets of users, about 3k+ users in test and train datasets, about 38k users who had features such as age and gender described, and a much large set of users (about 1 million+ if I recall correctly) who were friends of the earlier set of users, but had nothing else known about them. It was a similar situation with the events, where a small set were well described and a large set had no features associated. 

The third notable learning was the size of the data. While the dataset would not qualify as 'big data', while attempting to run Principal Component Analysis on the user-friend matrix, I ran into memory issues with Octave even on sparse matrices. This matrix had about 38k rows (only the well described users), and had about 210k entries. I was unable to run "svds" on this sparse matrix as I was unable to scale the features of the covariance matrix to have a zero mean and standard deviation of 1. Perhaps using Graphlab to get the matrix  decomposition would have been a better idea.

At the end of the day, these submissions ranked in the middle of the leaderboard. Hoping to do better next time.

Here's my Clojure and Incanter sources on Github.

Edit: Here's another take on the same competition.

Saturday 2 March 2013

Reviews of some Coursera courses that I completed

A long overdue post on some Coursera courses that I completed.

Probabilistic Graphical Models by Prof Koller


Probabilistic Graphical Models is probably the most difficult course that I did. I had ventured into this course with a fair background of linear algebra and probability, but I still found this course extremely demanding.

Here are some reasons why:
     the course material is compressed and the instructor (Prof. Koller) moves through the material rather quickly.
     This course has a few prerequisites that aren’t mentioned as, well, prerequisites: basic machine learning, felicity with Octave/Matlab, the ability to convert mathematical notation to working Matlab code.
     I found the programming assignments to be quite tough, or put differently, more difficult than any other Coursera course. If not for the generous help of other students in the forum, (a view echoed by many other students), it would be impossible to get through the assignments. In my case, I had to sometimes grind through the assignment to finish it in time, sometimes missing the intuition behind a method that was used in the assignment.
     The time estimate of about 15 hours per week (spent on the course) is quite low in my opinion, it would be closer to 20/25 hours per week.

However, this is still a must-do course for students of ML. The material, as well as the instructor are excellent, and the programming exercises are beautifully designed, in that they give you a real world problem to solve in almost all the assignments which leads to much better insights. Having completed the course, I keep going back to the assignments and course material to refresh my memory every once in a few weeks.

Course difficulty (on  a scale of 5): 5/5
Time required per week: 20 hours.
Prerequisites: Probability, Octave, Basic ML
Fun quotient: 4/5



Machine Learning by Prof. Andrew Ng


One cannot ask for a better introduction to Machine learning. This course has a gentle pace, no prerequisites, and gives a 10000 ft view of the many aspects of Machine learning in general.
Prof Ng is a very gentle instructor, and he tries to break down the complicated math and explain it in a very understanding manner.
The assignments are probably too easy as they just ask you to fill up a few lines in Octave files, but that does not take away the usefulness of the course. It shouldn’t be hard to complete this course in parallel with a couple of others

Course difficulty (on  a scale of 5): 2.5/5
Time required per week: 6 hours.
Prerequisites: none
Fun quotient: 4/5



Neural Networks for Machine Learning by Prof. Hinton


Prof. Hinton is one of the leading lights of Neural Networks, an area of ML research that had been relegated to the sidelines in the 80s and 90s but is now in the limelight thanks to recent advances in the field.

The basic ML course does dip its toes in the neural networks pool, but this course (but naturally) goes much deeper. The material as well as the instructor are excellent, and the course lectures are punctuated by Prof. Hinton’s deadpan humour. This course requires a little mathematical background (such as a calculating derivatives for backpropagation) but the math is explained quite nicely.
My only suggestion in this course would be to have more programming assignments (there were only 4 in the first edition), however, the quizzes do require some programming to answer questions.

Course difficulty (on  a scale of 5): 3.5/5
Time required per week: 10 hours.
Prerequisites: Bits of calculus, octave,
Fun quotient: 5/5







Machine learning endeavour- Mid term review

Mid term review
It’s been a few months since I posted my machine learning goals, and this seems an appropriate time to review what’s been done so far and what’s left to go.
I had started out with a goal of 5 ML related courses, and I am on track to do a little more than I had planned. These are the courses that I've now targeted for completion (all @ Coursera)
No
Course name
Completed
Certificate
1
Probabilistic Graphical Models
June 2012
2
Computing for Data Analysis
Dec 2012
3
Machine Learning
Dec 2012
4
Neural Networks for machine learning
Dec 2012
5
Data Analysis
in progress
6
Linear and Discrete Optimization
in progress
7
Computational Methods for Data Analysis
in progress
non-certificate course
8
Natural Language Processing
in progress
I also took a large part of the “Design and Analysis of Algorithms I” course which I could not complete due to overlapping course schedules.
My take on each of these courses is in another blog post.  
I had another goal of entering 4 kaggle competitions and finishing in the top 50%, which appears to be a little more difficult to achieve. I entered more than 2 competitions, and with my best placing being 55/940 in one (digit recognizer training) competition, with no significant placings in others.
My perspective is that competitions are probably not the best place to learn skills for a few reasons.
One, competition data usually needs quite a bit of pre-processing and exploratory analysis to figure out the approach.
Secondly, most competition data is not small enough to fit in a typical PC’s memory, and thus will require some big data and/or parallelization approaches such as Apache Mahout/CUDA and the knowledge of the accompanying toolset. Thus a significant part of your time would be spent in learning a toolset which is useful, but orthogonal to the goal of getting feedback on your machine learning approach.
Third, to get high on the leaderboard for a competition usually requires ensemble methods, or in other words, trying out several models and then using model averaging.

However, Kaggle competitions are a great way to hone your skills once you have a good grasp of the toolset that you plan to use, and the ups and downs on the leaderboard are quite a bit more exciting than assignments in a course. And its certainly possible to get decent results with simpler methods such as regression.
Hopefully my next ‘status update’ would see me close to completion. :)