This paper explores the structure of the feature point cloud discovered by sparse autoencoders in large language models. It investigates three scales: atomic, brain, and galaxy. The atomic scale involves crystal structures with parallelograms or trapezoids, improved by projecting out distractor dimensions. The brain scale focuses on modular structures, similar to neural lobes. The galaxy scale examines the overall shape and clustering of the point cloud.
Last week, Anthropic announced a significant breakthrough in our understanding of how large language models work. The research focused on Claude 3 Sonnet, the mid-sized version of Anthropic’s latest frontier model. Anthropic showed that it could transform Claude's otherwise inscrutable numeric representation of words into a combination of ‘features’, many of which can be understood by human beings. The vectors Claude uses to represent words can be understood as the sum of ‘features’—vectors that represent a variety of abstract concepts from immunology to coding errors to the Golden Gate Bridge. This research could prove useful for Anthropic and the broader industry, potentially leading to new tools to detect model misbehavior or prevent it altogether.
"...a feature that activates when Claude reads a scam email (this presumably supports the model’s ability to recognize such emails and warn you not to respond to them). Normally, if one asks Claude to generate a scam email, it will refuse to do so. But when we ask the same question with the feature artificially activated sufficiently strongly, this overcomes Claude's harmlessness training and it responds by drafting a scam email."
Fourier features in learning systems like neural networks due to the downstream invariance of the learner that becomes insensitive to certain transformations, e.g., planar translation or rotation.
OpenAI's new GPT-4o model is now available for free, but ChatGPT Plus subscribers still get access to more prompts and newer features. This article compares what's available to both free and paid users.
This article discusses cyclical encoding as an alternative to one-hot encoding for time series features in machine learning. Cyclical encoding provides the same information to the model with significantly fewer features.