Created
April 3, 2026 03:42
-
-
Save Nirav-Madhani/ca335e67aa3ea1d02a03739fe52dc82a to your computer and use it in GitHub Desktop.
Task RL Training (GRPO on GSM8K) - Self-contained Colab notebook
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "nbformat": 4, | |
| "nbformat_minor": 0, | |
| "metadata": { | |
| "colab": { | |
| "provenance": [], | |
| "gpuType": "A100" | |
| }, | |
| "kernelspec": { | |
| "name": "python3", | |
| "display_name": "Python 3" | |
| }, | |
| "language_info": { | |
| "name": "python" | |
| }, | |
| "accelerator": "GPU" | |
| }, | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# Task RL Training (GRPO on GSM8K)\n", | |
| "\n", | |
| "**Step 8** of the meditation training pipeline. Trains the model to solve math problems using GRPO with binary correctness reward.\n", | |
| "\n", | |
| "The model still uses its meditation ability (from Step 7) — it meditates then solves.\n", | |
| "\n", | |
| "## Setup\n", | |
| "1. **GPU**: Change runtime to GPU (A100 recommended) via Runtime > Change runtime type\n", | |
| "2. **Secrets**: Add these in the Secrets panel (left sidebar):\n", | |
| " - `HF_TOKEN` — HuggingFace token (read/write)\n", | |
| "3. **Run all cells** (Ctrl+F9)\n", | |
| "\n", | |
| "**No judge needed** — reward is purely programmatic (answer matches GSM8K ground truth).\n", | |
| "\n", | |
| "Auto-resumes from latest HF checkpoint. Checkpoints upload to HF every 10 steps.\n", | |
| "\n", | |
| "**Model**: LFM2.5-1.2B-Thinking (SFT + Meditation RL + fresh LoRA for Task RL)\n", | |
| "**Dataset**: GSM8K train (7473 problems)\n", | |
| "**Reward**: Binary correctness (1.0 if correct, 0.0 if wrong)\n", | |
| "**Repo**: Nirav-Madhani/LFM2.5-1.2B-Meditation" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": {}, | |
| "source": [ | |
| "# Cell 1: Load secrets\n", | |
| "from google.colab import userdata\n", | |
| "import os\n", | |
| "\n", | |
| "os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')\n", | |
| "\n", | |
| "# Optional: Gemini key (not needed for Task RL, but set if available)\n", | |
| "try:\n", | |
| " os.environ['GEMINI_PAID_KEY'] = userdata.get('GEMINI_PAID_KEY')\n", | |
| "except Exception:\n", | |
| " pass\n", | |
| "\n", | |
| "print('Secrets loaded')\n", | |
| "print(f'HF_TOKEN: {os.environ[\"HF_TOKEN\"][:8]}...')" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": {}, | |
| "source": [ | |
| "# Cell 2: Check GPU\n", | |
| "import torch\n", | |
| "if torch.cuda.is_available():\n", | |
| " name = torch.cuda.get_device_name(0)\n", | |
| " vram = torch.cuda.get_device_properties(0).total_memory / 1024**3\n", | |
| " print(f'GPU: {name} ({vram:.1f} GB)')\n", | |
| "else:\n", | |
| " raise RuntimeError('No GPU! Change runtime type: Runtime -> Change runtime type -> GPU')" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": {}, | |
| "source": [ | |
| "# Cell 3: Download training script from HuggingFace\n", | |
| "!pip install -q huggingface_hub\n", | |
| "\n", | |
| "from huggingface_hub import hf_hub_download\n", | |
| "from pathlib import Path\n", | |
| "\n", | |
| "WORK_DIR = Path('/content/meditation')\n", | |
| "WORK_DIR.mkdir(parents=True, exist_ok=True)\n", | |
| "\n", | |
| "HF_REPO = 'Nirav-Madhani/LFM2.5-1.2B-Meditation'\n", | |
| "HF_TOKEN = os.environ['HF_TOKEN']\n", | |
| "\n", | |
| "script_path = hf_hub_download(\n", | |
| " repo_id=HF_REPO,\n", | |
| " filename='task-rl-training.py',\n", | |
| " local_dir=WORK_DIR,\n", | |
| " token=HF_TOKEN,\n", | |
| ")\n", | |
| "print(f'Training script: {script_path}')\n", | |
| "\n", | |
| "print('\\nFiles in work dir:')\n", | |
| "for f in sorted(WORK_DIR.rglob('*')):\n", | |
| " if f.is_file():\n", | |
| " print(f' {f.relative_to(WORK_DIR)} ({f.stat().st_size/1024:.0f} KB)')" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": {}, | |
| "source": [ | |
| "# Cell 4: Run Task RL training\n", | |
| "# Downloads SFT + meditation RL checkpoints from HF automatically\n", | |
| "# Dataset (GSM8K) is downloaded from HuggingFace Datasets\n", | |
| "# No judge API needed — reward is binary correctness\n", | |
| "\n", | |
| "!cd /content/meditation && python -u task-rl-training.py" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": {}, | |
| "source": [ | |
| "# Cell 5 (optional): Check GPU memory\n", | |
| "import torch\n", | |
| "if torch.cuda.is_available():\n", | |
| " alloc = torch.cuda.memory_allocated() / 1024**3\n", | |
| " total = torch.cuda.get_device_properties(0).total_memory / 1024**3\n", | |
| " print(f'GPU Memory: {alloc:.1f} / {total:.1f} GB ({alloc/total*100:.0f}%)')" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "metadata": {}, | |
| "source": [ | |
| "# Cell 6 (optional): List checkpoints on HuggingFace\n", | |
| "from huggingface_hub import HfApi\n", | |
| "api = HfApi(token=os.environ['HF_TOKEN'])\n", | |
| "files = api.list_repo_files('Nirav-Madhani/LFM2.5-1.2B-Meditation')\n", | |
| "ckpts = sorted([f for f in files if 'task_rl' in f and 'checkpoint' in f])\n", | |
| "print(f'Task RL checkpoints on HF ({len(ckpts)}):')\n", | |
| "for c in ckpts:\n", | |
| " print(f' {c}')" | |
| ], | |
| "execution_count": null, | |
| "outputs": [] | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment