Trackio debug

Primary root cause

In Spaces, Trackio forces SQLite into DELETE journal mode on every connection, not just once at DB creation. See trl- internal/lib/python3.11/site-packages/trackio/sqlite_storage.py:43 and trl-internal/lib/python3.11/site-packages/trackio/ sqlite_storage.py:101.
Write paths do take a per-project file lock, for example in init_db and bulk_log. See trl-internal/lib/python3.11/site- packages/trackio/sqlite_storage.py:143 and trl-internal/lib/python3.11/site-packages/trackio/sqlite_storage.py:623.
Read/UI paths do not take that lock. get_alerts and get_logs open DB connections directly, and those connections still run _configure_sqlite_pragmas(). See trl-internal/lib/python3.11/site-packages/trackio/sqlite_storage.py:809 and trl-internal/ lib/python3.11/site-packages/trackio/sqlite_storage.py:967. Your container traces are failing in that pragma path before the actual SELECT.
So the UI polling endpoints are not really “read-only” from a DB-behavior perspective. They open unsynchronized connections and run a journal-setting pragma while concurrent writes are happening.

That gives a plausible failure chain:

Concurrent log writes enter bulk_log.
One writer acquires the Trackio file lock and then blocks inside SQLite because of concurrent readers / connection setup on the same DB.
Other writers wait on the Trackio file lock and hit Could not acquire database lock after 10 seconds.
Once the DB gets into a bad state, read paths start throwing sqlite3.DatabaseError: database disk image is malformed and file is not a database.

Why Trackio doesn’t recover

The server retry queue only catches sqlite3.OperationalError. See trl-internal/lib/python3.11/site-packages/trackio/ server.py:39 and trl-internal/lib/python3.11/site-packages/trackio/server.py:417.
Your actual failures are OSError and sqlite3.DatabaseError, so they bypass the queue entirely and bubble up.

Secondary corruption vector

If this Space has dataset sync enabled, Trackio has another unsafe path: export_to_parquet() and import_from_parquet() touch the same DB without taking the same process lock. See trl-internal/lib/python3.11/site-packages/trackio/ sqlite_storage.py:330, trl-internal/lib/python3.11/site-packages/trackio/sqlite_storage.py:442, and trl-internal/lib/ python3.11/site-packages/trackio/sqlite_storage.py:545.
import_from_parquet() rewrites tables with if_exists="replace" and no process lock. If that path is active, it can turn lock contention into actual DB corruption very easily.

Bottom line

The core design bug is: Trackio serializes writes with its own file lock, but it does not serialize reads, and in Spaces every connection still mutates SQLite connection state with PRAGMA journal_mode = DELETE. Under heavy concurrent logging plus UI polling, that is enough to produce exactly the lock timeout and malformed-database pattern you saw.

If you want, I can turn this into a concrete minimal patch against the installed Trackio code to validate the hypothesis.

#!/usr/bin/env python import argparse import re from pathlib import Path from datasets import load_dataset from trl import SFTConfig, SFTTrainer DEFAULT_MODEL_ID = "Qwen/Qwen3-4B-Instruct-2507" DEFAULT_DATASET_NAME = "trl-lib/Capybara" DEFAULT_PROJECT = "sft-trackio-429-debug" DEFAULT_TRACKIO_SPACE_ID = "lewtun/sft-trackio-429-debug" def sanitize_name(value: str) -> str: sanitized = re.sub(r"[^A-Za-z0-9._-]+", "-", value).strip("-").lower() if not sanitized: raise ValueError(f"Could not derive a valid name from {value!r}") return sanitized def parse_args() -> argparse.Namespace: parser = argparse.ArgumentParser(description="Minimal Qwen 4B SFT Trackio repro.") parser.add_argument("--run-name", required=True, help="Unique Trackio run name.") parser.add_argument( "--model-id", default=DEFAULT_MODEL_ID, help=f"Model to fine-tune. Defaults to {DEFAULT_MODEL_ID}.", ) parser.add_argument( "--dataset-name", default=DEFAULT_DATASET_NAME, help=f"Public dataset to use. Defaults to {DEFAULT_DATASET_NAME}.", ) parser.add_argument( "--project", default=DEFAULT_PROJECT, help=f"Shared Trackio project. Defaults to {DEFAULT_PROJECT}.", ) parser.add_argument( "--trackio-space-id", default=DEFAULT_TRACKIO_SPACE_ID, help=f"Shared Trackio Space. Defaults to {DEFAULT_TRACKIO_SPACE_ID}.", ) parser.add_argument("--hub-model-id", help="Hub repo to push the final model to.") parser.add_argument("--output-dir", help="Local output directory.") parser.add_argument("--max-steps", type=int, default=1000, help="Number of optimizer steps.") parser.add_argument("--max-length", type=int, default=1024, help="Token sequence length.") parser.add_argument("--dataset-num-proc", type=int, default=1, help="Dataset preprocessing workers.") parser.add_argument("--learning-rate", type=float, default=1e-6, help="AdamW learning rate.") parser.add_argument("--seed", type=int, default=42, help="Random seed.") return parser.parse_args() def build_training_args(args: argparse.Namespace) -> SFTConfig: sanitized_run_name = sanitize_name(args.run_name) output_dir = args.output_dir or f"scratch/outputs/{sanitized_run_name}" hub_model_id = args.hub_model_id or f"lewtun/qwen3-4b-sft-trackio-429-debug-{sanitized_run_name}" Path(output_dir).mkdir(parents=True, exist_ok=True) return SFTConfig( output_dir=output_dir, hub_model_id=hub_model_id, run_name=args.run_name, project=args.project, trackio_space_id=args.trackio_space_id, bf16=True, gradient_checkpointing=True, gradient_checkpointing_kwargs={"use_reentrant": False}, max_length=args.max_length, packing=False, dataset_num_proc=args.dataset_num_proc, per_device_train_batch_size=1, gradient_accumulation_steps=1, learning_rate=args.learning_rate, logging_strategy="steps", logging_steps=1, eval_strategy="no", save_strategy="no", max_steps=args.max_steps, report_to="trackio", push_to_hub=True, seed=args.seed, model_init_kwargs={"dtype": "bfloat16"}, ) def main() -> None: args = parse_args() train_dataset = load_dataset(args.dataset_name, split="train") training_args = build_training_args(args) trainer = SFTTrainer( model=args.model_id, args=training_args, train_dataset=train_dataset, ) trainer.train() trainer.save_model(training_args.output_dir) trainer.push_to_hub(dataset_name=args.dataset_name) if __name__ == "__main__": main()

lewtun/trackio-db-bug.md

Select an option

No results found

Select an option

No results found

lewtun commented Apr 11, 2026

Uh oh!

lewtun commented Apr 11, 2026 •

edited

Loading

Uh oh!

lewtun/trackio-db-bug.md

lewtun commented Apr 11, 2026

Uh oh!

lewtun commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lewtun commented Apr 11, 2026 •

edited

Loading