Three thoughts on turning policy calls for ‘interpretable’ AI systems into action
A quick note on policies to incentivize the development of transparent and steerable AI systems
The algorithms that deep learning systems 'learn' in order to execute tasks are largely unknown and relatively underexplored. These systems are built, trained and monitored with an eye on how well they perform on the data that is available, but the training process is an opaque one: only with post hoc probing of these systems do researchers assess what specifically the model may have learned and how it generates output. The post hoc methods that currently exist are not reliable enough to hold up in legal or regulatory settings and government calls for ‘interpretable’ or ‘explainable’ systems mostly leave these calls vague enough to avoid addressing this fact.
Large AI firms such as Anthropic are heavily invested in a field called mechanistic interpretability which seeks to 'reverse engineer' the learned algorithms into human-interpretable explanations of how these systems work. This field is still in the early stages of development and in order to both support these efforts and also make sure that regulatory needs get met I propose three solutions that can run concurrently:
1) Through measures such as grants and tax incentives, promote the development of model architectures that are inherently interpretable. Last year, Anthropic ran experiments to make models similar to those in the GPT family more interpretable. They made changes to the architecture to try to make it clear what each 'neuron' (i.e. model parameter) was responsible for. (e.g. a single neuron might be responsible for keeping track of occurrences of the word 'cat'). Industry efforts like these should be supported.
2) Incentivize academics from other disciplines to work on this problem. Making these systems interpretable without sacrificing performance will require input from researchers in applied math and the larger field of computer science. The machine learning community can't do it alone.
3) Engage with the interpretability techniques that currently exists with an eye towards the future. The machine learning community moves fast. If regulatory bodies engage researchers around existing mechanistic interpretability methods now with their expectations for near-term systems and those of the future, perhaps current methods can be improved and new methods will come from these interactions.
Policy calls for interpretable systems should come with an investment in growing the interpretability field into what it needs to become. These three measures would be non-adversarial efforts towards a public-private collaboration for interpretable and regulated systems.