jayflaunts-logs

Consolidates S3 access logs for the jayflaunts.jays.net podcast bucket into a single SQLite database for download analytics (2017 → present).

What this repo is

The S3 bucket jayflaunts.jays.net stores .mp3 files for Jay’s podcast and has had S3 server access logging enabled since 2017. Each listener request generates one tiny log file in s3://jayflaunts.jays.net/logs/. There are hundreds of thousands of these files. Most are S3stat housekeeping traffic; the ones we care about are REST.GET.OBJECT on .mp3 keys with HTTP 200 or 206.

Files

File Purpose
import_logs.py Parses log files → SQLite. Idempotent/incremental.
stats.py Query and print download stats from the DB.
downloads.db SQLite database (not committed).
logs/ Local mirror of s3://jayflaunts.jays.net/logs/ (not committed).

Workflow

1. Sync log files from S3 (run in chunks — there are hundreds of thousands)

Sync a single month (recommended — each month is ~30k files):

aws s3 sync s3://jayflaunts.jays.net/logs/ ./logs/ --exclude "*" --include "2018-01-*"

Or sync everything (slow, use only when catching up in bulk):

aws s3 sync s3://jayflaunts.jays.net/logs/ ./logs/

As of April 2026, only Oct–Dec 2017 has been pulled down (~80k files).

2. Import into SQLite

python3 import_logs.py

Already-processed months are tracked in the processed_months table and skipped on subsequent runs. Sync a full month, then run this — the importer processes all files for a month atomically and marks the whole month done in one commit.

3. Query stats

python3 stats.py                  # full report
python3 stats.py --monthly        # downloads per month
python3 stats.py --episodes       # all-time per episode
python3 stats.py --year 2018      # one year detail
python3 stats.py --episode earl4  # history for one episode (partial match)

Database schema

CREATE TABLE downloads (
    id          INTEGER PRIMARY KEY,
    timestamp   TEXT,      -- ISO 8601 UTC
    remote_ip   TEXT,
    operation   TEXT,      -- always REST.GET.OBJECT
    key         TEXT,      -- e.g. "021.mp3"
    http_status INTEGER,   -- 200 or 206
    bytes_sent  INTEGER,   -- bytes sent in this request (< object_size for range reqs)
    object_size INTEGER,   -- total file size
    referer     TEXT,
    user_agent  TEXT,
    log_file    TEXT       -- source log filename
);

CREATE TABLE processed_months (
    month        TEXT PRIMARY KEY,  -- YYYY-MM
    processed_at TEXT NOT NULL
);

Notes

  • HTTP 206 = range/partial request, common from podcast clients. Counted as a download.
  • HTTP 200 = full download.
  • bytes_sent < object_size is normal for 206 responses (seeked/skipped).
  • The S3 log format has a timestamp in the log file name and inside the log lines. We use the timestamp inside the line (field 3, bracketed) as it reflects actual request time, normalized to UTC ISO 8601.
  • Most log files contain only S3stat bucket-listing requests, not real downloads. Signal density is low (~567 downloads across 80k log files for late 2017).
  • April 2026:
    • 15,123,155 log files total in S3 across 2017–2026, which explains why full sync is impractical.
    • The latest log files are dated 2026-04-26, confirming the bucket is still actively logging right up to today. So you’ve got ~8.5 years of data waiting to be imported, month by month.