via-athena

Faster alternative to the local sync workflow in the parent directory. Instead of downloading 15M log files, Athena queries them directly in S3 and returns only the ~50k–100k rows that are actual podcast downloads. You download one small CSV; the local sync never happens.

Why

The aws-logs/ workflow requires syncing ~30k files per month, one month at a time, before importing. With 15M total log files across 8.5 years that is impractical. Athena scans all files in parallel on AWS infrastructure and returns only the signal rows.

Prerequisites

AWS CLI configured with credentials that have:
- s3:GetObject + s3:ListBucket on s3://jayflaunts.jays.net/logs/
- s3:PutObject on your Athena output bucket
- athena:StartQueryExecution, athena:GetQueryExecution, athena:GetQueryResults
An S3 bucket (or prefix) for Athena to write results to

One-time setup

python3 create_table.py [--region us-east-1]

Creates the jayflaunts_logs database and access_logs table directly in the AWS Glue Data Catalog via boto3 — bypasses Athena DDL entirely, which avoids engine-version syntax issues. Requires IAM permissions for glue:CreateDatabase and glue:CreateTable. The table is permanent; run this once.

Workflow

1. Run the export query

Paste export.sql into the Athena Query Editor and run it. Select database jayflaunts_logs in the dropdown first.

The query scans all 15M log files on AWS infrastructure — typically 1–5 minutes. When it finishes, click Download results in the Query Editor to get the CSV. Save it as downloads.csv in this directory.

2. Import into SQLite

python3 import_csv.py

Replaces all rows in ../downloads.db (the same database stats.py uses) and repopulates processed_months from the actual timestamps in the data.

3. Query stats

python3 ../stats.py
python3 ../stats.py --monthly
python3 ../stats.py --episodes
python3 ../stats.py --year 2018
python3 ../stats.py --episode earl4

CLI alternative

If you prefer the command line over the Query Editor, setup.sh and run_export.sh automate the same steps via aws athena commands.

Files

File	Purpose
`setup.sql`	Create Athena DB + external table (run once)
`export.sql`	Extraction query used by run_export.sh
`run_export.sh`	Run the query and download results CSV
`import_csv.py`	CSV → `../downloads.db`
`downloads.csv`	Athena output (not committed, produced by run_export.sh)

Cost

Athena charges ~$5 per TB scanned. S3 access log files are tiny (< 1 KB each). 15M files × ~500 bytes ≈ 7.5 GB → roughly $0.04 for a full history scan.

Notes

import_csv.py does a full replace of the downloads table each run. Re-run it whenever you want to refresh with a new Athena export.
The parent workflow (import_logs.py) and this one share ../downloads.db. Using one will overwrite the other’s data. Athena covers all years, so after a successful Athena import there is no reason to use the local workflow.
setup.sql uses CREATE ... IF NOT EXISTS so it is safe to re-run.