SemanticScuttle - klotz.me » klotz: transformer

klotz: transformer*

Contextual Transformer Embeddings Using Self-Attention Explained with Diagrams and Python Code

This article is part of a series titled ‘LLMs from Scratch’, a complete guide to understanding and building Large Language Models (LLMs). In this article, we discuss the self-attention mechanism and how it is used by transformers to create rich and context-aware transformer embeddings.

The Self-Attention mechanism is used to add context to learned embeddings, which are vectors representing each word in the input sequence. The process involves the following steps:

1. Learned Embeddings: These are the initial vector representations of words, learned during the training phase. The weights matrix, storing the learned embeddings, is stored in the first linear layer of the Transformer architecture.

2. Positional Encoding: This step adds positional information to the learned embeddings. Positional information helps the model understand the order of the words in the input sequence, as transformers process all words in parallel, and without this information, they would lose the order of the words.

3. Self-Attention: The core of the Self-Attention mechanism is to update the learned embeddings with context from the surrounding words in the input sequence. This mechanism determines which words provide context to other words, and this contextual information is used to produce the final contextualized embeddings.

2024-06-01 Tags: transformer, attention, self-attention, embeddings, nlp, deep learning, llm, machine learning by klotz

A Complete Guide to BERT with Code: History, Architecture, Pre-training, and Fine-tuning

In this article, we will explore various aspects of BERT, including the landscape at the time of its creation, a detailed breakdown of the model architecture, and writing a task-agnostic fine-tuning pipeline, which we demonstrated using sentiment analysis. Despite being one of the earliest LLMs, BERT has remained relevant even today, and continues to find applications in both research and industry.

2024-05-28 Tags: bert, llm, embedding, google, nlp, encoder-only, transformer by klotz

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

This paper introduces Cross-Layer Attention (CLA), an extension of Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) for reducing the size of the key-value cache in transformer-based autoregressive large language models (LLMs). The authors demonstrate that CLA can reduce the cache size by another 2x while maintaining nearly the same accuracy as unmodified MQA, enabling inference with longer sequence lengths and larger batch sizes.

2024-05-26 Tags: transformer, autoregressive language models, key-value cache, attention, multiquery attention, cross-layer attention, machine learning, computer science, llm, mit, csail by klotz

[2310.16028] What Algorithms can Transformers Learn? A Study in Length Generalization

Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algorithm for solving a task. We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Here, we propose a unifying framework to understand when and how Transformers can exhibit strong length generalization on a given task. Specifically, we leverage RASP (Weiss et al., 2021) -- a programming language designed for the computational model of a Transformer -- and introduce the RASP-Generalization Conjecture: Transformers tend to length generalize on a task if the task can be solved by a short RASP program which works for all input lengths. This simple conjecture remarkably captures most known instances of length generalization on algorithmic tasks. Moreover, we leverage our insights to drastically improve generalization performance on traditionally hard tasks (such as parity and addition). On the theoretical side, we give a simple example where the "min-degree-interpolator" model of learning from Abbe et al. (2023) does not correctly predict Transformers' out-of-distribution behavior, but our conjecture does. Overall, our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.

2024-01-07 Tags: llm, transformer, computational power, arxiv by klotz

Twitter: Hattie Zhou @oh_that_hat 25 Oct 2023 What algorithms can Transformers lear

2024-01-07 Tags: twitter, hatie zhou, transformer, rasp, computational power by klotz

Transformer-Based LLMs Are Not General Learners: A Universal Circuit Perspective [R]

2024-01-07 Tags: reddit, llm, transformer, rasp by klotz

RASP and the power of T-LLM

2024-01-07 Tags: llm, transformer, rasp, computational power, reddit by klotz

Transformer architecture:

2023-11-14 Tags: llm, transformer, bert by klotz

How to get prompts from images for Stable Diffusion - Stable Diffusion Art

2023-10-15 Tags: stable diffusion, transformer, clip, blip automatic1111, inverse rendering by klotz

Google AI Open-Sources Flan-T5: A Transformer-Based Language Model That Uses A Text-To-Text Approach For NLP Tasks - MarkTechPost