The world of large language models (LLMs) has seen tremendous growth in recent years, with advancements in training complex tasks like generating precise code and engaging in multi-step reasoning. However, even the most advanced SOTA models struggle to master these tasks due to the challenging nature of reinforcement learning (RL). At Levro, our startup aims to be the easiest medium for international commerce, which requires tools that bridge the gap between business and banking.
Our mission is to provide a truly 'personal' banker who can advise on nuanced requirements like documentation for anti-money laundering/KYC purposes. To achieve this, we developed agents to handle complex tasks such as technical customer support and structured reasoning. Specifically, our models generate Python code to handle user queries, call tools and APIs, and process structured data to resolve a range of user inquiries.
Unfortunately, even with the help of RL, our LLMs struggled to improve due to the issue of providing specific enough feedback without crushing the parts they already did well. We encountered this problem when we asked our model to make tool calls that generate 1099s for all vendors a business paid more than $600 USD to during the period 2024. The error in executing this request was to mistakenly ignore non-USD payments.
Reinforcement learning (RL) offers a powerful theoretical framework for teaching models to do "what works." However, applying these techniques to LLMs has been messy to execute in practice. We've run into this problem at our startup, and we wanted to find a solution that would allow us to build agents that could handle complex tasks with precision.
One of the challenges with RL is giving your model feedback that is specific enough to improve it without crushing the parts it already does well. This issue is particularly challenging when using traditional reward models that assign a single score to the entire output.
A popular technique for addressing this challenge is Direct Preference Optimization (DPO), which trains the model based on static pairs of good/bad responses and steers the model towards selecting the "good" response. Another approach is Group Relative Policy Optimization (GRPO), which uses the reward model's scores to steer the LLM.
However, traditional GRPO assigns a single reward to the entire output, which can lead to issues when an output performs well on one metric but poorly on another. This can result in a diluted feedback signal for the model, making it difficult for them to learn and improve.
To address this issue, we developed a new technique focused on rewarding the parts of an output that worked while precisely targeting specific failures. We extended the idea of process supervision into a practical, implementable solution: token-level reward attribution.
This approach makes RL training significantly more effective for LLMs tackling complex tasks by combining the memory efficiency and simplicity of GRPO with the token-sensitive rewards of PPO. Token-level rewards work by providing detailed scoring and attribution for each metric, allowing for a multi-faceted, independent evaluation.
Detailed Scoring and Attribution for Each Metric: Instead of just one overall score, for every quality being evaluated (e.g., syntax correctness, correct tool usage, output quality and relevance), a "judge LLM" is asked to provide both a score and identify the specific parts of the output that most influenced that score. This allows for a separate reward system for each quality, enabling more precise feedback for the model.
Crucially, each quality becomes its own separate reward system. Once scores are given for each metric, they are distributed to the specific tokens the judge LLM attributed them to. Reduced credit is also given to tokens that preceded them, as they helped set up the good or bad parts.
This process creates a "heatmap" for each metric, visually showing which tokens were most important or influential for that particular quality. For each individual reward metric, "advantages" are computed, which represent how much each part of the output helped or hurt the overall score relative to expectations.
These advantages are calculated independently for each reward function based on the total reward for that sequence. This independent calculation is vital because it addresses a common problem: if an output performs very well on one metric but poorly on another, a single average reward might dilute the feedback.
Finally, for the overall token-level advantages, the advantages from all individual metrics are summed up (with weights if appropriate). These token-level advantages are then directly used in the model's optimization or loss functions, providing precise feedback to the model exactly where it performed well or poorly.
The benefits of this new technique include speeding up learning, ensuring that good partial outputs are preserved, and making RL training practical even for smaller, more specialized models without requiring additional compute resources.
By leveraging per-token rewards, we can transform process supervision from an interesting concept into a practical tool. This approach enables LLMs to learn with unprecedented precision, leading to more efficient training, higher quality outputs, and models that genuinely understand why their outputs are good or bad, token by token.
This method has saved compute, improved training efficiency, and boosted output quality without the frustration of seeing your model regress on what it already does well.