This post documents my attempts on the Kaggle event recommendation competition.
This competition was the first non-training competition that I entered, and it was a good learning experience. The competition's goal was to classify a set of events that are presented to a user, sorted by how likely that particular user will click "Yes" to attending that event.
The data consists of several 'parts', such as:
1. A given user's recommendation of 'yes', or 'no' for an event.
2. Some data about the user such as location, gender, age etc
3. Some data about the event, such as the time and place.
4. A list of friends for a set of users
5. A list of attendees for a set of events.
My initial approach was to treat this as a classification problem, when given a [user,event] tuple, to assign a score of 1 if the user is likely to click on Yes for that event. I started out by hand-coding features such as
1. percentage of friends who marked yes, no, maybe to that event.
2. Parsing the location fields of the event and user to determine city, state and country and a set of binary features for the user being in the country, state and city as the event.
3. The event popularity.
4. Time based features, such as whether a user is old enough to visit an event, and the difference in hours between the user seeing the event, and the event starting.
I initially used the SVM classifier from libSVM but I ran into a couple of issues with libsvm. One, the scaling feature of libsvm didn't work very well (or rather the results were worse than when I scaled them outside libsvm), and second, a lot of times libsvm would predict the same class forall the tuples in the test set.
I then used a logistic regression classifier with a tiny (0.005) regularization value, which gave a classification rate of about 74%. The submission to kaggle resulted in a 47% success rate, which was only better than random values, but worse than the result sorted by popularity of the event.
At first sight, this competition resembled a typical collaborative filtering(CF) problem where each user gives a 'rating' to an event of 1 or 0. The only catch was, that in addition to the user-event matrix, there are additional features for both users and event that could not be directly used in CF solvers.
My second approach was to solve the CF problem. I had a choice of using Restricted Boltzmann Machines (described quite nicely in the Neural networks course @ Coursera), or using the normal matrix factorization approach. I used the excellent Graphlib package which has a toolkit for solving CF problems. The input is the user-event sparse matrix with 1's where ever a user marked yes for an event. I used the ALS solver (which tackles the problem of finding low rank approximation of the recommendation matrix) in the CF toolkit, which gave an error of 7% after some experimentation with step sizes. My submission with the results from the ALS solver again gave a similar score of 45% on the kaggle leader board.
I had then given up on the comp, when I got an idea of averaging the 2 models used. On averaging the score from both the models with equal weights, the score on the leader board exceeded the popularity measure by a small bit, about 51%.
What I learnt from this competition
Apart from the scoring and algorithms bit, I realized that I needed a whole new set of skills for such data analysis. The most important was the tools to clean and sort data. Since I had some experience with Clojure, I decided to put it to test. Initially I tried to load the data in-memory, which as expected, resulted in OutofMemoryErrors despite having 16G of RAM. Fortunately, I discovered Incanter, which gives Clojure capabilities similar to R. (As an aside, if you're looking at playing with R, do check out the Data Analysis course at Coursera by Jeff Leek, the capabilities of R described in the course are a revelation for the R newbie). Incanter too had similar memory issues, but thanks to its excellent integration with mongodb, I was able to load the datasets into mongodb and dissect the data as required.
The second facet which was new to me was the dirty-ness and variation among data. There were multiple sets of users, about 3k+ users in test and train datasets, about 38k users who had features such as age and gender described, and a much large set of users (about 1 million+ if I recall correctly) who were friends of the earlier set of users, but had nothing else known about them. It was a similar situation with the events, where a small set were well described and a large set had no features associated.
The third notable learning was the size of the data. While the dataset would not qualify as 'big data', while attempting to run Principal Component Analysis on the user-friend matrix, I ran into memory issues with Octave even on sparse matrices. This matrix had about 38k rows (only the well described users), and had about 210k entries. I was unable to run "svds" on this sparse matrix as I was unable to scale the features of the covariance matrix to have a zero mean and standard deviation of 1. Perhaps using Graphlab to get the matrix decomposition would have been a better idea.
At the end of the day, these submissions ranked in the middle of the leaderboard. Hoping to do better next time.
Here's my Clojure and Incanter sources on Github.
Edit: Here's another take on the same competition.