SemanticScuttle - klotz.me » klotz: multi-dimensional features+explainability

Not All Language Model Features Are Linear

This paper explores whether some language model representations may be inherently multi-dimensional, contrasting the linear representation hypothesis. The authors develop a method using sparse autoencoders to find multi-dimensional features in GPT-2 and Mistral 7B. They find interpretable examples such as circular features representing days of the week and months of the year, which are used to solve computational problems involving modular arithmetic.

2024-05-24 Tags: llm, explainability, multi-dimensional features, gpt-2, mistral 7b, circular features by klotz

SemanticScuttle - klotz.me

klotz: multi-dimensional features* + explainability*

Linked Tags

Related Tags