SemanticScuttle - klotz.me » Tags: explainability+llm

Not All Language Model Features Are Linear This bookmark is certified by an admin user.

This paper explores whether some language model representations may be inherently multi-dimensional, contrasting the linear representation hypothesis. The authors develop a method using sparse autoencoders to find multi-dimensional features in GPT-2 and Mistral 7B. They find interpretable examples such as circular features representing days of the week and months of the year, which are used to solve computational problems involving modular arithmetic.

2024-05-24 Tags: llm, explainability, multi-dimensional features, gpt-2, mistral 7b, circular features by klotz

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet This bookmark is certified by an admin user.

"scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet, 1 Anthropic's medium-sized production model.

We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities)."

2024-05-24 Tags: explainability, llm, ontology, anthropic, claude 3 by klotz

AI Alignment Breakthroughs this week This bookmark is certified by an admin user.

2023-10-12 Tags: lesswrong, ai, alignment, llm, explainability, ontology by klotz

SemanticScuttle - klotz.me

Tags: explainability* + llm*

Linked Tags

Related Tags