SAEs in Formal Languages

Worked with Dr. Ekdeep Singh Lubana and Dr. David Krueger to investigate the properties of SAEs, particularly causality, through the lens of formal languages (specifically PCFGs). We train SAEs of various paradigms on formal languages of varying complexity, and conclude that SAEs require more guarantees in terms of robustness and causality than are presently available.

Submitted at MINT (NeuRIPS) ‘24.