Skip to content

Instantly share code, notes, and snippets.

@lewtun
Last active April 11, 2026 10:44
Show Gist options
  • Select an option

  • Save lewtun/80398a7d8af3845e35e4db78a1cd8bb8 to your computer and use it in GitHub Desktop.

Select an option

Save lewtun/80398a7d8af3845e35e4db78a1cd8bb8 to your computer and use it in GitHub Desktop.
Trackio debug

Primary root cause

  • In Spaces, Trackio forces SQLite into DELETE journal mode on every connection, not just once at DB creation. See trl- internal/lib/python3.11/site-packages/trackio/sqlite_storage.py:43 and trl-internal/lib/python3.11/site-packages/trackio/ sqlite_storage.py:101.
  • Write paths do take a per-project file lock, for example in init_db and bulk_log. See trl-internal/lib/python3.11/site- packages/trackio/sqlite_storage.py:143 and trl-internal/lib/python3.11/site-packages/trackio/sqlite_storage.py:623.
  • Read/UI paths do not take that lock. get_alerts and get_logs open DB connections directly, and those connections still run _configure_sqlite_pragmas(). See trl-internal/lib/python3.11/site-packages/trackio/sqlite_storage.py:809 and trl-internal/ lib/python3.11/site-packages/trackio/sqlite_storage.py:967. Your container traces are failing in that pragma path before the actual SELECT.
  • So the UI polling endpoints are not really “read-only” from a DB-behavior perspective. They open unsynchronized connections and run a journal-setting pragma while concurrent writes are happening.

That gives a plausible failure chain:

  1. Concurrent log writes enter bulk_log.
  2. One writer acquires the Trackio file lock and then blocks inside SQLite because of concurrent readers / connection setup on the same DB.
  3. Other writers wait on the Trackio file lock and hit Could not acquire database lock after 10 seconds.
  4. Once the DB gets into a bad state, read paths start throwing sqlite3.DatabaseError: database disk image is malformed and file is not a database.

Why Trackio doesn’t recover

  • The server retry queue only catches sqlite3.OperationalError. See trl-internal/lib/python3.11/site-packages/trackio/ server.py:39 and trl-internal/lib/python3.11/site-packages/trackio/server.py:417.
  • Your actual failures are OSError and sqlite3.DatabaseError, so they bypass the queue entirely and bubble up.

Secondary corruption vector

  • If this Space has dataset sync enabled, Trackio has another unsafe path: export_to_parquet() and import_from_parquet() touch the same DB without taking the same process lock. See trl-internal/lib/python3.11/site-packages/trackio/ sqlite_storage.py:330, trl-internal/lib/python3.11/site-packages/trackio/sqlite_storage.py:442, and trl-internal/lib/ python3.11/site-packages/trackio/sqlite_storage.py:545.
  • import_from_parquet() rewrites tables with if_exists="replace" and no process lock. If that path is active, it can turn lock contention into actual DB corruption very easily.

Bottom line

The core design bug is: Trackio serializes writes with its own file lock, but it does not serialize reads, and in Spaces every connection still mutates SQLite connection state with PRAGMA journal_mode = DELETE. Under heavy concurrent logging plus UI polling, that is enough to produce exactly the lock timeout and malformed-database pattern you saw.

If you want, I can turn this into a concrete minimal patch against the installed Trackio code to validate the hypothesis.

@lewtun
Copy link
Copy Markdown
Author

lewtun commented Apr 11, 2026

Debug script

#!/usr/bin/env python
import argparse
import re
from pathlib import Path

from datasets import load_dataset

from trl import SFTConfig, SFTTrainer


DEFAULT_MODEL_ID = "Qwen/Qwen3-4B-Instruct-2507"
DEFAULT_DATASET_NAME = "trl-lib/Capybara"
DEFAULT_PROJECT = "sft-trackio-429-debug"
DEFAULT_TRACKIO_SPACE_ID = "lewtun/sft-trackio-429-debug"


def sanitize_name(value: str) -> str:
    sanitized = re.sub(r"[^A-Za-z0-9._-]+", "-", value).strip("-").lower()
    if not sanitized:
        raise ValueError(f"Could not derive a valid name from {value!r}")
    return sanitized


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="Minimal Qwen 4B SFT Trackio repro.")
    parser.add_argument("--run-name", required=True, help="Unique Trackio run name.")
    parser.add_argument(
        "--model-id",
        default=DEFAULT_MODEL_ID,
        help=f"Model to fine-tune. Defaults to {DEFAULT_MODEL_ID}.",
    )
    parser.add_argument(
        "--dataset-name",
        default=DEFAULT_DATASET_NAME,
        help=f"Public dataset to use. Defaults to {DEFAULT_DATASET_NAME}.",
    )
    parser.add_argument(
        "--project",
        default=DEFAULT_PROJECT,
        help=f"Shared Trackio project. Defaults to {DEFAULT_PROJECT}.",
    )
    parser.add_argument(
        "--trackio-space-id",
        default=DEFAULT_TRACKIO_SPACE_ID,
        help=f"Shared Trackio Space. Defaults to {DEFAULT_TRACKIO_SPACE_ID}.",
    )
    parser.add_argument("--hub-model-id", help="Hub repo to push the final model to.")
    parser.add_argument("--output-dir", help="Local output directory.")
    parser.add_argument("--max-steps", type=int, default=1000, help="Number of optimizer steps.")
    parser.add_argument("--max-length", type=int, default=1024, help="Token sequence length.")
    parser.add_argument("--dataset-num-proc", type=int, default=1, help="Dataset preprocessing workers.")
    parser.add_argument("--learning-rate", type=float, default=1e-6, help="AdamW learning rate.")
    parser.add_argument("--seed", type=int, default=42, help="Random seed.")
    return parser.parse_args()


def build_training_args(args: argparse.Namespace) -> SFTConfig:
    sanitized_run_name = sanitize_name(args.run_name)
    output_dir = args.output_dir or f"scratch/outputs/{sanitized_run_name}"
    hub_model_id = args.hub_model_id or f"lewtun/qwen3-4b-sft-trackio-429-debug-{sanitized_run_name}"

    Path(output_dir).mkdir(parents=True, exist_ok=True)

    return SFTConfig(
        output_dir=output_dir,
        hub_model_id=hub_model_id,
        run_name=args.run_name,
        project=args.project,
        trackio_space_id=args.trackio_space_id,
        bf16=True,
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},
        max_length=args.max_length,
        packing=False,
        dataset_num_proc=args.dataset_num_proc,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        learning_rate=args.learning_rate,
        logging_strategy="steps",
        logging_steps=1,
        eval_strategy="no",
        save_strategy="no",
        max_steps=args.max_steps,
        report_to="trackio",
        push_to_hub=True,
        seed=args.seed,
        model_init_kwargs={"dtype": "bfloat16"},
    )


def main() -> None:
    args = parse_args()

    train_dataset = load_dataset(args.dataset_name, split="train")
    training_args = build_training_args(args)

    trainer = SFTTrainer(
        model=args.model_id,
        args=training_args,
        train_dataset=train_dataset,
    )

    trainer.train()
    trainer.save_model(training_args.output_dir)
    trainer.push_to_hub(dataset_name=args.dataset_name)


if __name__ == "__main__":
    main()

@lewtun
Copy link
Copy Markdown
Author

lewtun commented Apr 11, 2026

Strangely, we don't seem to render any logs at all, even though the bucket exists https://huggingface.co/buckets/lewtun/sft-trackio-429-debug-bucket

Space: https://huggingface.co/spaces/lewtun/sft-trackio-429-debug

Screenshot 2026-04-11 at 12 39 12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment