via-athena
via-athena
Faster alternative to the local sync workflow in the parent directory. Instead of downloading 15M log files, Athena queries them directly in S3 and returns only the ~50k–100k rows that are actual podcast downloads. You download one small CSV; the local sync never happens.
Why
The aws-logs/ workflow requires syncing ~30k files per month, one month
at a time, before importing. With 15M total log files across 8.5 years
that is impractical. Athena scans all files in parallel on AWS infrastructure
and returns only the signal rows.
Prerequisites
- AWS CLI configured with credentials that have:
s3:GetObject+s3:ListBucketons3://jayflaunts.jays.net/logs/s3:PutObjecton your Athena output bucketathena:StartQueryExecution,athena:GetQueryExecution,athena:GetQueryResults
- An S3 bucket (or prefix) for Athena to write results to
One-time setup
python3 create_table.py [--region us-east-1]
Creates the jayflaunts_logs database and access_logs table directly in the
AWS Glue Data Catalog via boto3 — bypasses Athena DDL entirely, which avoids
engine-version syntax issues. Requires IAM permissions for glue:CreateDatabase
and glue:CreateTable. The table is permanent; run this once.
Workflow
1. Run the export query
Paste export.sql into the Athena Query Editor and run it.
Select database jayflaunts_logs in the dropdown first.
The query scans all 15M log files on AWS infrastructure — typically 1–5 minutes.
When it finishes, click Download results in the Query Editor to get the CSV.
Save it as downloads.csv in this directory.
2. Import into SQLite
python3 import_csv.py
Replaces all rows in ../downloads.db (the same database stats.py uses)
and repopulates processed_months from the actual timestamps in the data.
3. Query stats
python3 ../stats.py
python3 ../stats.py --monthly
python3 ../stats.py --episodes
python3 ../stats.py --year 2018
python3 ../stats.py --episode earl4
CLI alternative
If you prefer the command line over the Query Editor, setup.sh and
run_export.sh automate the same steps via aws athena commands.
Files
| File | Purpose |
|---|---|
setup.sql |
Create Athena DB + external table (run once) |
export.sql |
Extraction query used by run_export.sh |
run_export.sh |
Run the query and download results CSV |
import_csv.py |
CSV → ../downloads.db |
downloads.csv |
Athena output (not committed, produced by run_export.sh) |
Cost
Athena charges ~$5 per TB scanned. S3 access log files are tiny (< 1 KB each). 15M files × ~500 bytes ≈ 7.5 GB → roughly $0.04 for a full history scan.
Notes
import_csv.pydoes a full replace of the downloads table each run. Re-run it whenever you want to refresh with a new Athena export.- The parent workflow (
import_logs.py) and this one share../downloads.db. Using one will overwrite the other’s data. Athena covers all years, so after a successful Athena import there is no reason to use the local workflow. setup.sqlusesCREATE ... IF NOT EXISTSso it is safe to re-run.