This post aims to address the problem of corrigibility as identified by MIRI in 2015. We propose an extended formalism that allows us to write the desiderata of a corrigible behaviour, and provide theoretical solutions with helpful illustrations of each proposal. The first extension is to make the agent behave as if the shutdown button does not exist, and the second is to make the agent behave as if the button does not work.
The first section's goal is to recall the formalism of MIRI's article Corrigibility, as well as the Big Gamble problem, and to introduce corrigibility diagrams for the analysis of corrigibility proposals.The second will then introduce a new formalism to reformulate mathematically the problem.The next two sections will each provide an (incomplete) solution to the corrigibility problem by making the button inexistent or inefficient.