Analyzing (In)Abilities of SAEs via Formal Languages
Published in MINT@NeurIPS, 2024
We formulate a synthetic testbed to stress-test the sparse autoencoder (SAE) approach to interpretability in the text domain, using formal languages.
Authors: Abhinav Menon, Manish Shrivastava, Ekdeep Singh Lubana, David Krueger
Download Paper