A recent study shows that one large language model (LLM) demonstrates impressive linguistic analysis abilities, rivaling those of human linguistics graduate students. Researchers tested LLMs on complex linguistic tasks, including recursion and phonological rule inference, revealing that OpenAI’s o1 model performed significantly better than others, challenging conventional views on the limits of AI in understanding language.
This article details the process of building a fast vector search system for a large legal dataset (Australian High Court decisions). It covers choosing embedding providers, performance benchmarks, using USearch and Isaacus embeddings, and the importance of API terms of service. It focuses on achieving speed and scalability while maintaining reasonable accuracy.
An extremely lightweight universal grammar implementation with provable recursion, based on Chomsky's Minimalist Grammar theory, fitting in under 50kB with zero runtime dependencies. It includes a probabilistic language model extension and formal verification.
This page details the command-line utility for the Embedding Atlas, a tool for exploring large text datasets with metadata. It covers installation, data loading (local and Hugging Face), visualization of embeddings using SentenceTransformers and UMAP, and usage instructions with available options.
Python tutorial for reproducible labeling of cutting-edge topic models with GPT4-o-mini. The article details training a FASTopic model and labeling its results using GPT-4.0 mini, emphasizing reproducibility and control over the labeling process.
An article discussing the use of subpixel rendering to create legible text on a very small LCD display, achieving a 40-character terminal on a 24 mm x 24 mm screen with a resolution of 240 x 240.
Minuet: Dance with LLM in Your Code
Multi-class zero-shot embedding classification and error checking. This project improves zero-shot image/text classification using a novel dimensionality reduction technique and pairwise comparison, resulting in increased agreement between text and image classifications.
A post with pithy observations and clear conclusions from building complex LLM workflows, covering topics like prompt chaining, data structuring, model limitations, and fine-tuning strategies.
This article details the often overlooked cost of storing embeddings for RAG systems, and how quantization techniques (int8 and binary) can significantly reduce storage requirements and improve retrieval speed without substantial accuracy loss.