The competition has ended and I got the best score (competing alone) since I started doing this a couple of years ago: 190 / 1004 on the private leaderboard (with 0.92401 AUC). I am writing this post to link the code I used for it and to add a brief summary of what I did.
Here is the Bitbucket repository, made public after the competition ended. On Github I keep a public Kaggle code repository so I can centralize all the code from the different competitions.
I run my analyses on a very old computer (a Dell D830 way past its life expectancy), so holding all the data on memory is possible but some other options could be better.
sqlite> .schema CREATE TABLE bids (bid_id integer, bidder_id text, auction text, merchandise text, device text, time real, country text, ip text, url text); CREATE TABLE train (bidder_id text, payment_account text, address text, outcome real); CREATE TABLE test (bidder_id text, payment_account text, address text); CREATE INDEX idx1 ON bids(bidder_id); CREATE INDEX idx2 ON train(bidder_id); CREATE INDEX idx3 ON test(bidder_id);
(First I edited the original CSV files to remove the header.)
I also created some indexes on the
bidder_id variable, as it is the most frequently used. In the end, the database is stored in a file with a size of 1,3 GB. I did not copy that file to my Dropbox as I did with other data files; instead, I generated it once for each computer I used during the competition.
I did the analysis in R and I interacted with the database using RSQLite, which was very convenient.
Code, wiki and issues
I use a bitbucket.org git repository for storing the code. I use bitbucket instead of github.com because the former allows to create private repositories for free. Other than that, they are similar in terms of functionality as far as I am concerned.
Within the Bitbucket repository I have a wiki page titled Features to build. Every time I think of some feature that might be useful, I add it there. When I have written the code for it, I cross it out.
For any other random thing that comes to my mind while developing, or even doing some other thing, I create an issue so I can come back at that later.
Also, very importantly, every time I submit a solution that increases my ranking in the leaderboard, I create a new tag with the score:
$ git tag 0.85729 0.86792 0.87816 0.88370 0.90379 0.91274
That way I can always recover the code that yielded the highest score (for instance, this generated my best submission, and that is the code uploaded to the github repository).
In my opinion, this is the main point of this competition. There is no downloadable data matrix with all the features already built-in; instead, participants must query the database in order to extract the most relevant features according to their understanding of the problem. I think this is what makes this competition particularly relevant (and fun): it is close to a real data analysis problem, where cleaning up and feature extraction are the first things to do.
For instance, is it relevant to know how many bids a user makes?
SELECT bidder_id, COUNT(bidder_id) as bid_count FROM bids GROUP BY bidder_id
In order to inspect these variables, I like to use ggplot's
geom_density to compare the distribution of values for both groups:
ggplot(train) + geom_density(aes(x = bid_count, fill = outcome), alpha = 0.4)
Yes, it is a very good variable (please notice that I log-transformed it beforehand). In fact, with this variable alone I obtain an AUC of 0.829 during cross-validation.
So, intuitively, what are bots? What should their defining characteristics be? The amount of different hypotheses we can build is nearly infinite, but just for starters:
- They bid a lot in very short times.
- They bid faster than a human user.
- They may have weird user-agents.
- Perhaps some IPs are mostly used by bots.
And so on and so forth. The possibilities are endless. In the end I tried creating as many features as I could think of (see
At first I was not really paying attention to adequate model training. I just had the
classifier.R file and I played around a bit with the different parameters for several classifiers. This is far from optimal, but it was enough at the beginning.
Then I started taking hyperparameter tuning seriously and started
classifier_caret.R, which uses this excellent library and allowed me to have much better algorithm tuning using cross-validation. It is possible to get the
train function from caret to use the area under the ROC curve as the evaluation metric. Here is a very good stackoverflow thread discussing it.
A bit further along the road, I tried an ensemble of classifiers, which is coded in
classifier_ensemble.R, but I could not get anything useful from there. My most successful submission used the
So, this is not my field (and the more I study in my field, the less I think I have one). I thought that perhaps someone has already studied this problem and has published some nice conclusions about it.
Though I would like to say that I read a bunch of the available literature on the subject and extracted some useful information from that, the fact is that I did some previous research, wrote a list of papers I should read and then I did not have time to do so. In any case, if anyone is interested, here is the list I put together.
I also thought of checking web forums where people ask for bots in case I could get more insights from those posts, but real life also needed my time.
This was lots of fun. There are a couple more competitions like this one right now and I'd love to join them if I have enough free time. It was a pity that this competition did not allow teams, as 99 % of the difficulty was the feature building step. I would have loved to have some more time to invest in this challenge (like, for instance, read the bibliography cited above), as I think I missed some relevant variables that could probably be easily created.