Best Practices - xlfilldown

CLI Recipes

Common command-line patterns for everyday use.

CLI

Basic hierarchical fill to SQLite

The most common use case: a grouped .xlsx export with Region/Country/City hierarchy. Fill down, add audit columns, and load into SQLite.

$ xlfilldown db \
    --infile quarterly_sales.xlsx \
    --insheet "Q4 Data" \
    --header-row 1 \
    --fill-cols '["Region","Country","City"]' \
    --db sales.db \
    --row-hash \
    --excel-row-numbers \
    --drop-blank-rows

CLI

Independent fill using column letters

When columns are independent (not hierarchical), use independent mode. Reference columns by their Excel letter instead of header names - useful when headers are long or contain special characters.

$ xlfilldown xlsx \
    --infile inventory.xlsx \
    --insheet "Stock" \
    --header-row 1 \
    --fill-cols-letters A C E \
    --fill-mode independent \
    --outfile inventory_filled.xlsx

CLI

Replace an existing table on re-run

When running the same ingest repeatedly (e.g. weekly report refresh), use --if-exists replace to drop and recreate the table each time.

$ xlfilldown db \
    --infile weekly_report.xlsx \
    --insheet "Data" \
    --header-row 1 \
    --fill-cols '["Department","Team"]' \
    --db reports.db \
    --table weekly_headcount \
    --if-exists replace

CLI

Append incremental data

For incremental loads where each run adds new rows (e.g. monthly exports appended to an annual table).

$ xlfilldown db \
    --infile march_data.xlsx \
    --insheet "Sheet1" \
    --header-row 1 \
    --fill-cols '["Category","Subcategory"]' \
    --db annual.db \
    --table transactions \
    --row-hash \
    --excel-row-numbers \
    --if-exists append

Note: When appending, the existing table schema must exactly match the expected columns (including row_hash and excel_row if enabled). xlfilldown validates this before inserting.

CLI

Raw ingest for audit baseline

Load a sheet exactly as-is (no fill-down) but with audit columns. Useful for keeping a raw copy alongside the filled version for comparison.

# Raw copy
$ xlfilldown db \
    --infile data.xlsx --insheet "Sheet1" --header-row 1 \
    --ingest-mode raw \
    --db audit.db --table raw_data \
    --row-hash --excel-row-numbers --if-exists replace

# Filled copy in the same database
$ xlfilldown db \
    --infile data.xlsx --insheet "Sheet1" --header-row 1 \
    --fill-cols '["Region","Country"]' \
    --db audit.db --table filled_data \
    --row-hash --excel-row-numbers --if-exists replace

CLI

Filter rows with require-non-null

Drop rows where critical columns are blank after filling. Combine with --drop-blank-rows to also remove spacer rows.

$ xlfilldown db \
    --infile report.xlsx \
    --insheet "Financials" \
    --header-row 3 \
    --fill-cols '["Division","BU"]' \
    --drop-blank-rows \
    --require-non-null '["Division","Amount"]' \
    --db clean.db

Python API Recipes

Using xlfilldown programmatically in scripts and applications.

Python

Basic script usage

The simplest way to use xlfilldown in a Python script. Two lines to go from messy .xlsx to clean SQLite.

from xlfilldown.api import ingest_excel_to_sqlite

summary = ingest_excel_to_sqlite(
    file="report.xlsx",
    sheet="Sheet1",
    header_row=1,
    fill_cols=["Region", "Country"],
    db="output.db",
    row_hash=True,
    excel_row_numbers=True,
    if_exists="replace",
)

print(f"Ingested {summary['rows_ingested']} rows into {summary['table']}")
print(f"Columns: {summary['columns']}")

Python

Excel-to-Excel transformation

Read a messy .xlsx, fill it down, and write a clean .xlsx. No database involved.

from xlfilldown.api import ingest_excel_to_excel

summary = ingest_excel_to_excel(
    file="raw_export.xlsx",
    sheet="Data",
    header_row=1,
    fill_cols=["Department", "Team", "Role"],
    fill_mode="hierarchical",
    outfile="clean_export.xlsx",
    outsheet="Filled",
    drop_blank_rows=True,
    if_exists="replace",
)

print(f"Wrote {summary['rows_written']} rows")

Python

Process multiple sheets from one workbook

Loop through several sheets and load each into its own SQLite table.

from xlfilldown.api import ingest_excel_to_sqlite

sheets = [
    ("North America", ["Country", "State"]),
    ("EMEA", ["Country", "City"]),
    ("APAC", ["Country", "Region"]),
]

for sheet_name, cols in sheets:
    summary = ingest_excel_to_sqlite(
        file="global_sales.xlsx",
        sheet=sheet_name,
        header_row=1,
        fill_cols=cols,
        db="sales.db",
        row_hash=True,
        if_exists="replace",
    )
    print(f"{sheet_name}: {summary['rows_ingested']} rows")

Python

Using column letters in the API

When you know column positions but not header names (common with inconsistent exports), use fill_cols_letters.

from xlfilldown.api import ingest_excel_to_sqlite

summary = ingest_excel_to_sqlite(
    file="messy_export.xlsx",
    sheet="Sheet1",
    header_row=2,                   # Headers on row 2
    fill_cols_letters=["A", "B", "C"],  # First three columns
    db="output.db",
    if_exists="replace",
)

Integration Patterns

Embed xlfilldown into web apps, pipelines, and automation workflows.

Flask

Flask upload endpoint

Accept an .xlsx upload, process it with xlfilldown, and return a summary. Users upload via a form or API call.

import os
import tempfile
from flask import Flask, request, jsonify
from xlfilldown.api import ingest_excel_to_sqlite

app = Flask(__name__)

@app.route("/ingest", methods=["POST"])
def ingest():
    uploaded = request.files["file"]
    sheet = request.form.get("sheet", "Sheet1")
    fill_cols = request.form.getlist("fill_cols")

    with tempfile.TemporaryDirectory() as tmp:
        inpath = os.path.join(tmp, "upload.xlsx")
        dbpath = os.path.join(tmp, "result.db")
        uploaded.save(inpath)

        summary = ingest_excel_to_sqlite(
            file=inpath,
            sheet=sheet,
            header_row=1,
            fill_cols=fill_cols,
            db=dbpath,
            row_hash=True,
            excel_row_numbers=True,
            if_exists="replace",
        )

    return jsonify(summary)

FastAPI

FastAPI upload endpoint

Same idea with FastAPI and async file handling.

import tempfile
from pathlib import Path
from fastapi import FastAPI, UploadFile, Form
from xlfilldown.api import ingest_excel_to_sqlite

app = FastAPI()

@app.post("/ingest")
async def ingest(
    file: UploadFile,
    sheet: str = Form("Sheet1"),
    fill_cols: str = Form('["Region","Country"]'),
):
    import json
    cols = json.loads(fill_cols)

    with tempfile.NamedTemporaryFile(suffix=".xlsx", delete=False) as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name

    db_path = tmp_path.replace(".xlsx", ".db")
    summary = ingest_excel_to_sqlite(
        file=tmp_path,
        sheet=sheet,
        header_row=1,
        fill_cols=cols,
        db=db_path,
        row_hash=True,
        if_exists="replace",
    )

    Path(tmp_path).unlink(missing_ok=True)
    return summary

Pipeline

Shell script for cron / CI

A simple shell script that runs in cron or a CI pipeline. Downloads a report, processes it, and logs the result.

#!/bin/bash
# /opt/etl/ingest_weekly.sh
# Runs every Monday at 6am via cron

set -euo pipefail

INFILE="/data/incoming/weekly_report.xlsx"
DB="/data/warehouse/reports.db"
LOG="/var/log/etl/weekly_ingest.log"

echo "[$(date -Iseconds)] Starting ingest" >> "$LOG"

xlfilldown db \
    --infile "$INFILE" \
    --insheet "Data" \
    --header-row 1 \
    --fill-cols '["Region","Country","City"]' \
    --db "$DB" \
    --table weekly_sales \
    --row-hash \
    --excel-row-numbers \
    --drop-blank-rows \
    --if-exists replace \
    >> "$LOG" 2>&1

echo "[$(date -Iseconds)] Done" >> "$LOG"

Pipeline

Docker one-liner

Run xlfilldown in a container without installing anything on the host. Mount your data directory and go.

$ docker run --rm \
    -v /path/to/data:/data \
    python:3.12-slim \
    sh -c "pip install -q xlfilldown && xlfilldown db \
      --infile /data/report.xlsx \
      --insheet Sheet1 \
      --header-row 1 \
      --fill-cols '[\"Region\",\"Country\"]' \
      --db /data/output.db \
      --row-hash --excel-row-numbers \
      --if-exists replace"

Pipeline

Airflow task

Use xlfilldown as a step in an Airflow DAG with the BashOperator or PythonOperator.

from airflow.decorators import task

@task()
def ingest_sales_report(infile: str, db: str):
    from xlfilldown.api import ingest_excel_to_sqlite

    summary = ingest_excel_to_sqlite(
        file=infile,
        sheet="Sheet1",
        header_row=1,
        fill_cols=["Region", "Country"],
        db=db,
        row_hash=True,
        excel_row_numbers=True,
        drop_blank_rows=True,
        if_exists="replace",
    )
    return summary

Data Quality Patterns

Use audit columns and hashing for validation and debugging.

Quality

Deduplication with row hashes

After ingesting with --row-hash, query for duplicate rows. Identical content produces identical hashes.

-- Find duplicate rows
SELECT row_hash, COUNT(*) AS cnt
FROM Sheet1
GROUP BY row_hash
HAVING cnt > 1
ORDER BY cnt DESC;

-- Keep only the first occurrence
DELETE FROM Sheet1
WHERE rowid NOT IN (
    SELECT MIN(rowid) FROM Sheet1 GROUP BY row_hash
);

Quality

Trace back to the source spreadsheet

With --excel-row-numbers, every record carries its original .xlsx row number. When a downstream user reports a problem, you can pinpoint the exact source row.

-- "This revenue figure looks wrong" - find the source row
SELECT excel_row, Region, Country, City, Product, Revenue
FROM Sheet1
WHERE Revenue = '99999';

-- Result: excel_row = 47
-- Open sales_report.xlsx, go to row 47, verify the value

Quality

Compare raw vs filled data

Ingest twice - once raw, once filled - and compare. Useful for validating that the fill logic matches your expectations.

from xlfilldown.api import ingest_excel_to_sqlite

# Raw ingest
ingest_excel_to_sqlite(
    file="data.xlsx", sheet="Sheet1", header_row=1,
    ingest_mode="raw",
    db="compare.db", table="raw",
    excel_row_numbers=True, if_exists="replace",
)

# Filled ingest
ingest_excel_to_sqlite(
    file="data.xlsx", sheet="Sheet1", header_row=1,
    fill_cols=["Region", "Country"],
    db="compare.db", table="filled",
    excel_row_numbers=True, if_exists="replace",
)

# Now query to see what changed

-- Find rows where fill-down changed a value
SELECT
    r.excel_row,
    r.Region AS raw_region,
    f.Region AS filled_region,
    r.Country AS raw_country,
    f.Country AS filled_country
FROM raw r
JOIN filled f ON r.excel_row = f.excel_row
WHERE r.Region != f.Region OR r.Country != f.Country
   OR (r.Region IS NULL AND f.Region IS NOT NULL)
   OR (r.Country IS NULL AND f.Country IS NOT NULL);

Design Patterns

Architectural approaches for common scenarios.

Pattern

Hierarchical vs independent - choosing the right mode

Use hierarchical when columns have a parent-child relationship (Region > Country > City). A change in a parent column should invalidate stale child values.

Use independent when columns are unrelated dimensions that happen to share blank patterns. For example, "Salesperson" and "Product Line" might both be sparse but don't depend on each other.

Rule of thumb: If you would use a tree diagram to describe the relationship between columns, use hierarchical. If the columns are flat attributes, use independent.

Pattern

Header row is not row 1

Many .xlsx exports have title rows, blank rows, or metadata above the actual headers. Use --header-row to point at the correct row. Everything above that row is ignored.

# Headers are on row 3 (rows 1-2 are title/metadata)
$ xlfilldown db \
    --infile finance_report.xlsx \
    --insheet "P&L" \
    --header-row 3 \
    --fill-cols '["Account Group","Account"]' \
    --db finance.db

Pattern

Batch size tuning for large files

The default batch size of 1000 rows works well for most files. For very large sheets (100K+ rows), a larger batch size reduces SQLite transaction overhead.

$ xlfilldown db \
    --infile huge_export.xlsx \
    --insheet "Data" \
    --header-row 1 \
    --fill-cols '["Category"]' \
    --db warehouse.db \
    --batch-size 5000

Pattern

Combining require-non-null with letters

Mix header-name and column-letter approaches for filtering. Both are merged and de-duplicated internally.

$ xlfilldown db \
    --infile data.xlsx \
    --insheet "Sheet1" \
    --header-row 1 \
    --fill-cols-letters A B C \
    --require-non-null '["Amount"]' \
    --require-non-null-letters A \
    --db filtered.db

Quick Reference

At-a-glance summary of the most common options.

Task	Key Options
Fill by header names	`--fill-cols '["Col1","Col2"]'`
Fill by column letters	`--fill-cols-letters A B C`
Hierarchical fill (default)	`--fill-mode hierarchical`
Independent fill	`--fill-mode independent`
Skip fill-down	`--ingest-mode raw`
Add row hashes	`--row-hash`
Add source row numbers	`--excel-row-numbers`
Drop spacer rows	`--drop-blank-rows`
Require non-null columns	`--require-non-null '["Col"]'`
Overwrite on re-run	`--if-exists replace`
Append to existing	`--if-exists append`
Custom table name	`--table my_table`
Tune batch size	`--batch-size 5000`

API Reference

Key functions and their return values.

Function	Output	Return Keys
`ingest_excel_to_sqlite()`	SQLite database	`table`, `columns`, `rows_ingested`, `row_hash`, `excel_row_numbers`
`ingest_excel_to_excel()`	.xlsx file	`workbook`, `sheet`, `columns`, `rows_written`, `row_hash`, `excel_row_numbers`
`normalize_headers()`	list[str]	Trimmed, normalized header strings
`canon_list()`	str	JSON-format canonical string
`sha256_hex()`	str	64-char hex digest
`qident()`	str	Safely quoted SQLite identifier