Skip to content

Instantly share code, notes, and snippets.

@Nirav-Madhani
Created April 3, 2026 03:42
Show Gist options
  • Select an option

  • Save Nirav-Madhani/ca335e67aa3ea1d02a03739fe52dc82a to your computer and use it in GitHub Desktop.

Select an option

Save Nirav-Madhani/ca335e67aa3ea1d02a03739fe52dc82a to your computer and use it in GitHub Desktop.
Task RL Training (GRPO on GSM8K) - Self-contained Colab notebook
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "A100"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Task RL Training (GRPO on GSM8K)\n",
"\n",
"**Step 8** of the meditation training pipeline. Trains the model to solve math problems using GRPO with binary correctness reward.\n",
"\n",
"The model still uses its meditation ability (from Step 7) — it meditates then solves.\n",
"\n",
"## Setup\n",
"1. **GPU**: Change runtime to GPU (A100 recommended) via Runtime > Change runtime type\n",
"2. **Secrets**: Add these in the Secrets panel (left sidebar):\n",
" - `HF_TOKEN` — HuggingFace token (read/write)\n",
"3. **Run all cells** (Ctrl+F9)\n",
"\n",
"**No judge needed** — reward is purely programmatic (answer matches GSM8K ground truth).\n",
"\n",
"Auto-resumes from latest HF checkpoint. Checkpoints upload to HF every 10 steps.\n",
"\n",
"**Model**: LFM2.5-1.2B-Thinking (SFT + Meditation RL + fresh LoRA for Task RL)\n",
"**Dataset**: GSM8K train (7473 problems)\n",
"**Reward**: Binary correctness (1.0 if correct, 0.0 if wrong)\n",
"**Repo**: Nirav-Madhani/LFM2.5-1.2B-Meditation"
]
},
{
"cell_type": "code",
"metadata": {},
"source": [
"# Cell 1: Load secrets\n",
"from google.colab import userdata\n",
"import os\n",
"\n",
"os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')\n",
"\n",
"# Optional: Gemini key (not needed for Task RL, but set if available)\n",
"try:\n",
" os.environ['GEMINI_PAID_KEY'] = userdata.get('GEMINI_PAID_KEY')\n",
"except Exception:\n",
" pass\n",
"\n",
"print('Secrets loaded')\n",
"print(f'HF_TOKEN: {os.environ[\"HF_TOKEN\"][:8]}...')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {},
"source": [
"# Cell 2: Check GPU\n",
"import torch\n",
"if torch.cuda.is_available():\n",
" name = torch.cuda.get_device_name(0)\n",
" vram = torch.cuda.get_device_properties(0).total_memory / 1024**3\n",
" print(f'GPU: {name} ({vram:.1f} GB)')\n",
"else:\n",
" raise RuntimeError('No GPU! Change runtime type: Runtime -> Change runtime type -> GPU')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {},
"source": [
"# Cell 3: Download training script from HuggingFace\n",
"!pip install -q huggingface_hub\n",
"\n",
"from huggingface_hub import hf_hub_download\n",
"from pathlib import Path\n",
"\n",
"WORK_DIR = Path('/content/meditation')\n",
"WORK_DIR.mkdir(parents=True, exist_ok=True)\n",
"\n",
"HF_REPO = 'Nirav-Madhani/LFM2.5-1.2B-Meditation'\n",
"HF_TOKEN = os.environ['HF_TOKEN']\n",
"\n",
"script_path = hf_hub_download(\n",
" repo_id=HF_REPO,\n",
" filename='task-rl-training.py',\n",
" local_dir=WORK_DIR,\n",
" token=HF_TOKEN,\n",
")\n",
"print(f'Training script: {script_path}')\n",
"\n",
"print('\\nFiles in work dir:')\n",
"for f in sorted(WORK_DIR.rglob('*')):\n",
" if f.is_file():\n",
" print(f' {f.relative_to(WORK_DIR)} ({f.stat().st_size/1024:.0f} KB)')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {},
"source": [
"# Cell 4: Run Task RL training\n",
"# Downloads SFT + meditation RL checkpoints from HF automatically\n",
"# Dataset (GSM8K) is downloaded from HuggingFace Datasets\n",
"# No judge API needed — reward is binary correctness\n",
"\n",
"!cd /content/meditation && python -u task-rl-training.py"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {},
"source": [
"# Cell 5 (optional): Check GPU memory\n",
"import torch\n",
"if torch.cuda.is_available():\n",
" alloc = torch.cuda.memory_allocated() / 1024**3\n",
" total = torch.cuda.get_device_properties(0).total_memory / 1024**3\n",
" print(f'GPU Memory: {alloc:.1f} / {total:.1f} GB ({alloc/total*100:.0f}%)')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {},
"source": [
"# Cell 6 (optional): List checkpoints on HuggingFace\n",
"from huggingface_hub import HfApi\n",
"api = HfApi(token=os.environ['HF_TOKEN'])\n",
"files = api.list_repo_files('Nirav-Madhani/LFM2.5-1.2B-Meditation')\n",
"ckpts = sorted([f for f in files if 'task_rl' in f and 'checkpoint' in f])\n",
"print(f'Task RL checkpoints on HF ({len(ckpts)}):')\n",
"for c in ckpts:\n",
" print(f' {c}')"
],
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment