**OpenAI at QCon AI NYC: Fine Tuning the Enterprise**

At the recent QCon AI NYC 2025 conference, Will Hang from OpenAI presented a fascinating overview of Agent RFT (Reinforcement Fine-Tuning), a cutting-edge approach designed to improve the performance of tool-using agents. Hang's presentation shed light on a pragmatic improvement path that starts with prompt and task optimization before adjusting model weights.

Hang emphasized that these measures can be high-leverage but may plateau on tasks requiring consistent multi-step reasoning across tool interactions. He positioned fine-tuning options as a spectrum, highlighting the importance of selecting the right approach for the specific use case. Supervised fine-tuning was described as effective when there is a predictable mapping from input to output and the goal is to imitate a consistent style or structure.

Preference optimization was presented as a method for shifting outputs towards preferred responses using paired comparisons, and OpenAI's Direct Preference Optimization guide describes it as fine-tuning by comparing model outputs. However, Hang emphasized that this approach is currently limited to text inputs and outputs. Reinforcement fine-tuning, on the other hand, was highlighted as a better fit for tasks where the model needs to discover strategies over longer trajectories rather than reproduce a single demonstrated completion pattern.

Beware of reward hacking!

Resolve any edge cases in your grader
Continuous rewards work better than binary rewards

**Agent RFT: A Reinforcement Fine-Tuning Approach for Tool-Using Agents**

Agent RFT was presented as a reinforcement fine-tuning approach adapted to tool-using agents, where the model explores different strategies during training rollouts and receives a learning signal from a grader. OpenAI's documentation describes the loop as sampling candidate responses, scoring them with a grader you define, and updating the model based on those scores.

Hang emphasized credit assignment across the full trajectory so earlier decisions, including tool selection and tool-call structure, can be reinforced or discouraged based on downstream outcomes. He described an agent as a system that can interact with the outside world through tools, not only respond to a user prompt.

**Tool Examples and Grading Styles**

Hang provided examples of tools, including terminals for coding agents, internal business systems for customer support, and document search or retrieval endpoints. He emphasized that tool outputs flow back into the same context window, so tool calls, tool outputs, reasoning tokens, and the final response form a single multi-step trajectory.

He also mentioned that graders become a core artifact in the workflow, highlighting multiple grading styles, including simple matchers, model-based judges, code-based graders, endpoint graders, and combinations of graders to jointly optimize accuracy and latency. The session focused on operational properties that are not captured by answer accuracy alone.

**Use Cases and Platform Setup**

Wenjie Zi picked up the latter part of the presentation with use case presentations and platform setup details. She highlighted a finance-oriented example where a model must locate relevant content across a large document corpus under a constrained tool-call budget.

Zi described using search, listing, and file-reading tools exposed behind endpoints, then a grader scores the final answer. She emphasized using a model-based grader even for numeric answers to reduce false negatives caused by superficial formatting differences, units, or small variations.

**Conclusion**

The presentation concluded with reported outcomes emphasizing improved planning, reduced long trajectory tails, and in some cases a shift toward parallel tool calls to reduce sequential turns. Developers who want to learn more can review OpenAI's reinforcement fine-tuning and model optimization documentation and watch infoq.com in the coming weeks for video of the presentation to become available.

**Related Content**

* Agentic Orchestration in Action: How Naveo Commerce Rebuilt Order & Fulfillment for Scale * Master Workflow & Microservices Orchestration at Scale. See how Orkes, built on Netflix Conductor, helps teams simplify event-driven systems while ensuring reliability and enterprise readiness. * Learn More.

**Join Our Community**

Join a community of over 250,000 senior developers. View an example

This article has been rewritten to provide a more engaging and detailed experience for the reader. The HTML format allows for easy readability and navigation through the content.

HACKER_BLOG

OPENAI AT QCON AI NYC: FINE TUNING THE ENTERPRISE