This text is a detailed analysis of using proxies to guide the search for better alignment approaches in artificial intelligence. The author explores various aspects, including:
1. **Effectiveness of rejection sampling**: The post-rejection odds are determined by both the likelihood ratio and the base rate. The equation o(A|T,training)=P(T|A,training)P(T|¬A,training)o(A|training) is provided to calculate the cost of rejection sampling.
2. **Comparison between training regimens**: To determine which post-rejection odds are better between two training regimens, we must establish the net impact of both the likelihood ratio and the base rate.
3. **Goodhart's curse**: The author hopes that this analysis restores uncertainty about what's worth using during training versus keeping in reserve for the final critical training run. They also suggest that Goodhart's curse remains important when using proxies to guide the search for better alignment approaches.
4. **Proximity in parameter space**: If pass/fail are close in parameter space, Bayes' rule still applies. However, proximity affects the outcomes through its effects on o(A|training) and P(T|A,training)P(T|¬A,training). The author argues that this effect will manifest if each of your alignment ideas is already predetermined to produce only aligned (or only misaligned) models.
5. **Impact of training proxies**: Including a proxy as an incentive in the loss function and applying SGD can affect the outcome. However, it's essential to know how the original loss function treats passing versus failing misaligned models throughout training.
6. **Restarting from scratch**: Restarting training from scratch might be more practical than adding another training phase. The author suggests focusing on identifying promising alignment ideas through multiple attempts and using trusted misalignment proxies.
7. **Components of effective approaches**: Any effective approach likely contains components that could be viewed as "training on alignment proxies." This shouldn't cause alarm or prompt their removal from training.
Overall, this text provides a comprehensive analysis of using proxies in the search for better alignment approaches, highlighting both the benefits and potential challenges.