Outer alignment

Outer alignment is a concept in artificial intelligence (AI) safety that refers to the challenge of specifying training objectives for AI systems in a way that truly reflects human values and intentions.

A significant theoretical insight into alignment comes from computability theory. Some researchers argue that inner alignment is formally undecidable for arbitrary models, due to limits imposed by Rice's Theorem and Turing’s halting problem. This suggests that there is no general procedure for verifying alignment post hoc in unconstrained systems.

To circumvent this, the authors propose designing AI systems with halting-aware architectures that are provably aligned by construction. Examples include test-time training and constitutional classifiers, which enforce goal adherence through formal constraints. By ensuring that such systems always terminate and conform to predefined objectives, alignment becomes decidable and verifiable.^[1]

^ Melo, Gabriel A.; Máximo, Marcos R. O. A.; Soma, Nei Y.; Castro, Paulo A. L. (2025). "Machines that halt resolve the undecidability of artificial intelligence alignment". Scientific Reports. 15 (1): 15591. Bibcode:2025NatSR..1515591M. doi:10.1038/s41598-025-99060-2. PMC 12050267. PMID 40320467.

[1] Melo, Gabriel A.; Máximo, Marcos R. O. A.; Soma, Nei Y.; Castro, Paulo A. L. (2025). "Machines that halt resolve the undecidability of artificial intelligence alignment". Scientific Reports. 15 (1): 15591. Bibcode:2025NatSR..1515591M. doi:10.1038/s41598-025-99060-2. PMC 12050267. PMID 40320467.

[1]