Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitoring Performance
A recent study, CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring, has shown that when an LLM is given a main task and a side task and a monitor is tasked with alerting humans to the side task with a high suspicion score, CoT monitors can be combined with action-only monitors to get hybrid monitors that achieve very high catch rates for non-subtle tasks. However, all monitors perform poorly at catching subtle side tasks.
In this research note, we investigate an extract-and-evaluate monitoring scheme that combines the strengths of both CoT and action-only monitors. Our goal is to improve the performance of CoT monitors in subtle settings, where attackers may attempt to pursue unintended, potentially malicious side goals.
Methodology
In our experiment, we used the BigCodeBench dataset, which consists of 1,140 single-step code generation tasks in Python. We adapted all main tasks from this dataset and instructed an attacker model to complete a side task, simulating a situation where a future AI might pursue unintended, potentially malicious side goals.
We then tested four different monitor setups: CoT+action monitors, hybrid monitors that combine the scores of both types of monitors, and our extract-and-evaluate monitoring scheme. We used GPT 4.1-mini as both the Quote Extractor model and the Judge model in our setup.
Results
We found that our extract-and-evaluate monitoring scheme outperforms the CoT+action monitor in all but one case. The difference is particularly significant in subtle side tasks, where the extract-and-evaluate monitor improves monitor catch rate at 1% FPR by 31.0 percentage points on average.
We also tested a variation of our setup where GPT 4.1 as the smart Judge model only evaluates the quotes extracted by GPT 4.1-mini as the Quote Extractor model. However, we found that this did not lead to significant improvements in catch rates for subtle side tasks.
Conclusion
Our results suggest that extract-and-evaluate monitoring can significantly enhance CoT monitoring performance, particularly in subtle settings. We propose a new monitor scheme that combines the strengths of both CoT and action-only monitors. Future work should focus on improving the Quote Extractor model to catch more attacks at lower FPR.
Appendix
The full code for our experiment is available at https://github.com/Aether-AIS/cot-monitoring. We also provide a detailed appendix with further information about the experiment setting, including a function that calculates the average of the sums of absolute differences between each pair of consecutive numbers for all permutations of a given list.
The CoT+action monitor and Quote Extractor prompt are presented in Appendix B.