My take on Kaggle's "Instacart Market Basket Analysis"

In two days, this Kaggle competition will end. I took part in it because it was the kind of competition I enjoy: the problem is offered as is, as you would find it in a real-world environment, meaning that the building of the dataset, the feature engineering and all the associated decisions are part of the fun.

However, I didn't want to use machine learning. The competition forums are already full of people training neural networks, so I tried another approach. I just wanted to see how the data was structured, and then see if there was some extremely low hanging fruit I could pick up to obtain something better than the benchmark.

All the code I wrote can be found here. My current score in the private Leaderboard is 0.3598, which is not bad, all things considered. No models were trained during this low-cost experiment. The README.md file in that folder includes some notes I took during the development; they may make sense.

Data was offered as a set of CSV files, so the first thing I did was to insert them into several SQLite tables so I could query without reading the whole thing into memory (I run everything on a 2007 laptop. Low-cost, I said.)

What is being asked of us? To predict whether a certain user is going to reorder a certain product. For this, we are given a historical of orders per user (some have more, some have less), the distance in time between them in relative terms (we know an order happened 3 days after the last one, but we don't know if it was November at the time) and some other information I didn't use.

We can do something very simple just to start. For instance, we can assume the next order is going to be the same as the last one:

.headers on
.mode csv
.output /tmp/submission.csv

SELECT t.order_id, products FROM (
    SELECT m.user_id, order_id, GROUP_CONCAT(product_id, ' ') AS products

    FROM order_products__prior opp

    INNER JOIN (SELECT MAX(order_id) AS max_order_id,
                       user_id
                FROM orders
                WHERE eval_set = 'prior'
                GROUP BY user_id) m ON m.max_order_id = opp.order_id
    GROUP BY order_id, m.user_id) b

INNER JOIN (SELECT DISTINCT user_id, order_id
            FROM orders
            WHERE eval_set = 'test') t ON t.user_id = b.user_id

ORDER BY t.order_id DESC;

This produces a score of 0.2360713. As you can see, it's pure SQL.

We can do something a bit better: let's study, per user, how long is the average order, and reorder the most common products. This already requires a bit of Python (perhaps everything could be done in SQL, but using Python to add some computations made the whole thing much more convenient to write.) Still, the code is quite short:

import sqlite3
import pandas as pd
import csv
import gzip
from collections import defaultdict

if __name__ == '__main__':
    conn = sqlite3.connect('data/instacart.db')
    c = conn.cursor()

    # Obtain the average number of sales per customer
    q = """
    SELECT user_id, AVG(n_items) as avg_items 
    FROM (
        SELECT o.user_id AS user_id, o.order_id, COUNT(*) as n_items 
        FROM order_products__prior opp 
        JOIN orders AS o ON o.order_id = opp.order_id
        GROUP by o.user_id, o.order_id
    ) b
    GROUP BY user_id
    """
    # This does the same (?; re-check) and is much faster
    q = """
    SELECT user_id, 
           MIN(n_items) AS min_items, 
           AVG(n_items) AS avg_items,
           MAX(n_items) AS max_items
    FROM orders o
    INNER JOIN (SELECT order_id, COUNT(*) AS n_items
                FROM order_products__prior
                GROUP BY order_id) avg ON avg.order_id = o.order_id
    GROUP BY user_id
    """
    print "Getting order stats..."
    c.execute(q)

    order_stats = dict()

    print "Assigning to dictionary..."
    for row in c:
        order_stats[row[0]] = (row[1], row[2], row[3])

    # For every customer, sort the bought items in descending popularity
    q = """
    SELECT o.user_id AS user_id,
           opp.product_id AS product_id,
           COUNT(*) AS n -- improve here, probably
    FROM order_products__prior opp
    JOIN orders o ON o.order_id = opp.order_id
    GROUP BY o.user_id, opp.product_id
    ORDER BY o.user_id, n DESC
    """

    print "Getting product frequency..."
    c.execute(q)

    print "Assigning next order per user..."
    next_order = defaultdict(list)
    for row in c:
        if len(next_order[row[0]]) < order_stats[row[0]][1]: # the average
            next_order[row[0]].append(row[1])

    # Now just let's assign orders
    print "Generating CSV file..."
    q = "SELECT order_id, user_id FROM orders WHERE eval_set = 'test'"
    c.execute(q)
    result = []
    result.append(['order_id', 'products'])
    for row in c:
        result.append([row[0], " ".join([str(x) for x in next_order[row[1]]])])

    # Write compressed CSV file
    with gzip.open('/tmp/submission.csv.gz', 'wb') as f:
        csvwriter = csv.writer(f, delimiter = ',', quotechar = '"')
        for row in result:
            csvwriter.writerow(row)

You can see I was playing around a bit with the query at the beginning, trying to do some optimizations that made the whole thing run a lot faster. This yields 0.3282527 on the public leaderboard, which is not bad at all.

Without moving too much from this approach, how can we improve the result? We can weight each order to vary their importance in the next order. For instance, if something was just ordered, we can decrease the weight in the next one, but we'd have to study the periodicity for every user, and I wanted to keep this as simple as possible. There's a generic rule we could try: seasonal products have seasonal weights. Given that Instacart offers a lot of fruits and vegetables, I'm just going to assume that everything is seasonal, and that an order from next week carries a certain weight, but an order from six months ago carries almost no weight, all the way up to orders from exactly one year ago (the maximum time difference in the dataset), where the weights reach their maximum again.

This is done without much difficulty here using a cosine and written into a temporary table so I can easily join after using a query not very different from the one used in the first attempt. Several variations of this approach gave me the 0.3598 public score I ended up with.

There is no comment system. If you want to tell me something about this article, you can do so via e-mail or Mastodon.