In two days, this Kaggle competition will end. I took part in it because it was the kind of competition I enjoy: the problem is offered as is, as you would find it in a real-world environment, meaning that the building of the dataset, the feature engineering and all the associated decisions are part of the fun.
However, I didn't want to use machine learning. The competition forums are already full of people training neural networks, so I tried another approach. I just wanted to see how the data was structured, and then see if there was some extremely low hanging fruit I could pick up to obtain something better than the benchmark.
All the code I wrote can be found
here. My current
score in the private Leaderboard is 0.3598, which is not bad, all things
considered. No models were trained during this low-cost experiment. The
README.md file in that folder includes some notes I took during the
development; they may make sense.
Data was offered as a set of CSV files, so the first thing I did was to insert them into several SQLite tables so I could query without reading the whole thing into memory (I run everything on a 2007 laptop. Low-cost, I said.)
What is being asked of us? To predict whether a certain user is going to reorder a certain product. For this, we are given a historical of orders per user (some have more, some have less), the distance in time between them in relative terms (we know an order happened 3 days after the last one, but we don't know if it was November at the time) and some other information I didn't use.
We can do something very simple just to start. For instance, we can assume the next order is going to be the same as the last one:
.headers on .mode csv .output /tmp/submission.csv SELECT t.order_id, products FROM ( SELECT m.user_id, order_id, GROUP_CONCAT(product_id, ' ') AS products FROM order_products__prior opp INNER JOIN (SELECT MAX(order_id) AS max_order_id, user_id FROM orders WHERE eval_set = 'prior' GROUP BY user_id) m ON m.max_order_id = opp.order_id GROUP BY order_id, m.user_id) b INNER JOIN (SELECT DISTINCT user_id, order_id FROM orders WHERE eval_set = 'test') t ON t.user_id = b.user_id ORDER BY t.order_id DESC;
This produces a score of 0.2360713. As you can see, it's pure SQL.
We can do something a bit better: let's study, per user, how long is the average order, and reorder the most common products. This already requires a bit of Python (perhaps everything could be done in SQL, but using Python to add some computations made the whole thing much more convenient to write.) Still, the code is quite short:
import sqlite3 import pandas as pd import csv import gzip from collections import defaultdict if __name__ == '__main__': conn = sqlite3.connect('data/instacart.db') c = conn.cursor() # Obtain the average number of sales per customer q = """ SELECT user_id, AVG(n_items) as avg_items FROM ( SELECT o.user_id AS user_id, o.order_id, COUNT(*) as n_items FROM order_products__prior opp JOIN orders AS o ON o.order_id = opp.order_id GROUP by o.user_id, o.order_id ) b GROUP BY user_id """ # This does the same (?; re-check) and is much faster q = """ SELECT user_id, MIN(n_items) AS min_items, AVG(n_items) AS avg_items, MAX(n_items) AS max_items FROM orders o INNER JOIN (SELECT order_id, COUNT(*) AS n_items FROM order_products__prior GROUP BY order_id) avg ON avg.order_id = o.order_id GROUP BY user_id """ print "Getting order stats..." c.execute(q) order_stats = dict() print "Assigning to dictionary..." for row in c: order_stats[row] = (row, row, row) # For every customer, sort the bought items in descending popularity q = """ SELECT o.user_id AS user_id, opp.product_id AS product_id, COUNT(*) AS n -- improve here, probably FROM order_products__prior opp JOIN orders o ON o.order_id = opp.order_id GROUP BY o.user_id, opp.product_id ORDER BY o.user_id, n DESC """ print "Getting product frequency..." c.execute(q) print "Assigning next order per user..." next_order = defaultdict(list) for row in c: if len(next_order[row]) < order_stats[row]: # the average next_order[row].append(row) # Now just let's assign orders print "Generating CSV file..." q = "SELECT order_id, user_id FROM orders WHERE eval_set = 'test'" c.execute(q) result =  result.append(['order_id', 'products']) for row in c: result.append([row, " ".join([str(x) for x in next_order[row]])]) # Write compressed CSV file with gzip.open('/tmp/submission.csv.gz', 'wb') as f: csvwriter = csv.writer(f, delimiter = ',', quotechar = '"') for row in result: csvwriter.writerow(row)
You can see I was playing around a bit with the query at the beginning, trying to do some optimizations that made the whole thing run a lot faster. This yields 0.3282527 on the public leaderboard, which is not bad at all.
Without moving too much from this approach, how can we improve the result? We can weight each order to vary their importance in the next order. For instance, if something was just ordered, we can decrease the weight in the next one, but we'd have to study the periodicity for every user, and I wanted to keep this as simple as possible. There's a generic rule we could try: seasonal products have seasonal weights. Given that Instacart offers a lot of fruits and vegetables, I'm just going to assume that everything is seasonal, and that an order from next week carries a certain weight, but an order from six months ago carries almost no weight, all the way up to orders from exactly one year ago (the maximum time difference in the dataset), where the weights reach their maximum again.
This is done without much difficulty here using a cosine and written into a temporary table so I can easily join after using a query not very different from the one used in the first attempt. Several variations of this approach gave me the 0.3598 public score I ended up with.