ORM to check how right each step is? Then finetune? I think you could do something smart with test-time inference if you could train a reward-model on Geogeussr (this is so overkill)
Humans as the truth grounding for LLMs, it’s giving a bit lazy. Like they have to find ways to ground themselves.