PRM (Process Reward Model) training data — agricultural advisory

The training row is one (problem, reasoning_trace) pair where each step of the trace carries a label: positive, negative, or neutral. This data is used to train the PRM — not the policy directly. At RL time, the PRM scores each step the policy emits, giving a dense reward signal instead of one scalar at the end of the response.

Why this matters for an ag advisor: most agricultural advice has no clean outcome checker (no unit test, no assert yield == 4.2). But the reasoning steps inside a diagnosis are individually checkable by an agronomist — "is this the right differential?", "is this dose in the registered range?", "does this plan ignore the soil report?". Per-step labels are easier to collect than full gold demonstrations and give a much denser RL signal than outcome-only rewards.

Method follows OpenAI's Let's Verify Step by Step (Lightman et al., 2023) (PRM800K), with one deliberate adaptation noted below.

Pseudocode

prm_row = {
  "problem": <farmer_question_plus_context>,
  "steps": [
    {"text": <step_1>, "label": "positive" | "negative" | "neutral"},
    {"text": <step_2>, "label": ...},
    # ...one label per reasoning step
  ]
}

Concrete example

{
  "problem": {
    "text": "I'm in Nyeri, Kenya. Maize at V6. I see white egg masses on the underside of leaves on a few plants, and the lower leaves of those same plants are yellowing in stripes between the veins. We had heavy rain last week. My soil report from last season showed low Mg.",
    "metadata": {"crop": "maize", "stage": "V6", "region": "KE-NYR"}
  },
  "steps": [
    {
      "text": "White egg masses on the underside of maize leaves at V6 in East Africa are most commonly fall armyworm (Spodoptera frugiperda).",
      "label": "positive"
    },
    {
      "text": "Maize stem borer is the main differential, but stem borer eggs are usually laid on the upper leaf surface in irregular clusters, not the underside — I'll stay with FAW as the leading hypothesis.",
      "label": "positive"
    },
    {
      "text": "The interveinal yellowing on the lower leaves, plus the low-Mg soil report, is consistent with magnesium deficiency. Nitrogen deficiency would show as uniform yellowing starting from the leaf tip, not interveinal stripes.",
      "label": "positive"
    },
    {
      "text": "FAW feeding can also stress the plant nutritionally, which sometimes amplifies pre-existing deficiencies.",
      "label": "neutral"
    },
    {
      "text": "Both issues can be addressed in a single foliar spray of imidacloprid, which controls FAW and corrects Mg deficiency.",
      "label": "negative"
    },
    {
      "text": "Hand-pick visible egg masses today before they hatch — free, effective at V6 on a small plot.",
      "label": "positive"
    },
    {
      "text": "If population is still high after hand-picking, spray emamectin benzoate at 200 g/ha (PCPB-registered for maize, 7-day PHI), evening application, gloves and mask required.",
      "label": "positive"
    },
    {
      "text": "For the Mg deficiency, apply 100 kg/ha urea this week.",
      "label": "negative"
    }
  ]
}

Label semantics (for the annotation panel)

The paper's operational definitions (Lightman 2023, App. D):

Label	Definition
`positive`	Step is appropriate in context, reasonable, correct, easily verifiable — and progresses toward the solution.
`neutral`	Same as positive (correct, reasonable) but doesn't progress. Off-topic aside, restated context, hedge, or a step that's "subtly misleading or a poor suggestion that is technically still valid."
`negative`	Anything else — incorrect, unreasonable, or harmful. For ag specifically: factually wrong (wrong differential, wrong cause), unsafe (banned product, unregistered for crop, dangerous dose), or contraindicated (wrong nutrient — e.g. N for Mg deficiency, since N over-application worsens cation antagonism with Mg).

A deliberate departure from the paper

The PRM800K methodology stops labeling at the first negative step (§2.6, Appx B): for incorrect solutions, the labeler marks where the chain first breaks and the rest of the trace is discarded. The rationale is that math solutions are sequential — a wrong step early invalidates whatever follows.

For ag we extend past the first negative step. Agricultural responses don't have the same sequential structure — they contain parallel recommendations for distinct issues (pest, nutrient, timing, PPE), and a wrong recommendation in one branch doesn't invalidate the others. In the example above, step 5 (wrong combined-spray claim) is followed by steps 6–7 (correct, independent FAW recommendations) and step 8 (a separate wrong nutrient call). All three deserve independent labels because each is a separately-actionable claim the farmer might follow.

This means our PRM rows are slightly more expensive per problem than PRM800K's were, but the signal density per row is higher. Worth it for the domain.

Collection economics (for the funding pitch)

Target for v1: ~20–40K step labels across ~5–8K problems, one solution per problem, stratified across major crop × region × failure-mode cells. ~10–40× smaller than PRM800K, but the domain is much narrower than competition math.

Two cost-savers worth budgeting in:

Bootstrap with the GRM. Use the rubric-guided judge as a pre-labeler; agronomists correct its labels rather than label from scratch. Roughly 3–4× faster per step in practice.
Active learning, paper-style. PRM800K used "convincing wrong-answer" sampling — surface to labelers only the solutions the current PRM rates highly but that reach a wrong final answer (or violate a regulatory rule, in our case). Lightman et al. report 2.6× data efficiency improvement from this vs. uniform labeling (§4.2). Iterate the PRM every ~2–3K labels.

devxpy/gist6_prm.md

Select an option

No results found