Introduction
The
Microsoft Author-Paper Identification challenge is a machine learning competition hosted at
Kaggle.com, and this post documents my attempts at this problem.
The Data
The data consists of a training and validation set that were released early in the competition. The training data consisted of the following comma separated files:
- Paper.csv, where each record described data about a Paper, such as its id, publication year, journal or conference.
- Author.csv, where each record described an Author, his Affiliation, etc
- Journal.csv and Conference.csv, which described each journal or conference
- PaperAuthor.csv, a noisy dataset that listed Authors and Papers ascribed to them.
The training file consisted of an AuthorId, followed by a list of PaperIds that were correctly, and incorrectly assigned to that author. The submission format was that each line should contain an author id, followed by a list of paper ids, in decreasing order of probability that the paper was authored by the given author.
For e.g. Author1, P1 P2 P3
Where the probability that Author1 wrote P1 is highest, and P3 being the lowest.
Analysis
Although the evaluation metric used was Mean Average Precision, this problem is a classification problem. Given an Author id-Paper id tuple, predict the probability that the Paper was authored by the said author.
Provided along with the data was the code for a simple model that generated a 5 features, and then used a RandomForest classifier to classify each tuple.
Data Cleaning
I wasn't too comfortable with using sql to extract features, hence I loaded the csv files into mongodb using mongo import. The data cleaning stage took a while, I changed the contents to lowercase, removed newlines within some dataset entries and removed English stop words.
Approach:
The initial 5 features provided looked at the commonsense features such as number of coauthors who published in the same journal and so on. While treating this as a classification problem, the out-of-bag error rate (using RandomForests) was already around 95%.
I noticed at this point that to improve scores, I would have to start using features that depended on text fields such as AuthorName, PaperName, Affiliation (e.g. University).
In some cases the Author name was just marked wrong. For e.g. if 'Andrew B' was marked as the Author name for 99 papers and 'Kerning A' was the Author name for the 100th paper, this was obviously a mistake. I calculated the '
Jaccard distance' on the following set of features:
- Author names: Intersection of names in the author set with the current author name.
- Co author names: Intersection of coauthor names similar with current co-author.
- Affiliation: Intersection of this paper's affiliation with the current author's affiliations.
- Title: similarly computed the Jaccard similarity.
Adding all of these features improved my validation set submission score to about 95%. Despite toying around, I didn't make much improvements on the score after this.
I also tried to use a
Naive Bayes classifier to classify the 'bag of title words', into 2 classes, 'current author' or 'other'. Unfortunately the projected time it would take seemed like 4 continuous days of CPU time, and with Bangalore's power failures, it was unlikely it would ever finish, so I abandoned the approach.
My last score improvement came unexpectedly, I noticed that several author-paper entries that were 'correctly marked', had multiple entries in the dataset. I was surprised at this rather obvious mistake (or goofup, however you want to look at it), but adding this feature (number of paper-author counts) improved the classification rate by a whole percent.
Learning and Notes:
- This competition required creating hand-coded features from the multiple files that comprised of the dataset. The ability to quickly try out a feature and see if it improved scores was critical. I spent a lot of time on implementing a Naive Bayes classifier from scratch, without testing out the classification accuracy on a small part of the dataset, and finally had to abandon this feature due to lack of CPU hours.
- This was probably one competition where Hadoop/MapReduce skills wouldn't hurt, since the datasets were large enough that parallelization would have helped.
- The time required for training and prediction using RandomForests was really small, probably of the order of 10 minutes or less. This was very different from other competitions (where I used deep neural nets, for e.g.) where I would have to watch plots of errors of training and validation sets for hours on end.
- Observing quirks of the data is important, as the duplicate paper-author ids demonstrated.
- I didn't visualize or graph anything in this competition, and that's something I should improve upon, although it wasn't obvious what it might have been useful for, in this case.
This was the first competition where I got into the top 25%, and my overall Kaggle ranking was inside the top 1%.