This article explores the critical architectural decision of where to store conversation history when building AI agents. It examines how different storage strategies impact user experience, privacy, cost, and portability. The author compares service-managed versus client-managed storage models and details how modern APIs support both linear threads and forking/branching capabilities.
Key topics include:
* Service-Managed vs. Client-Managed storage tradeoffs
* Linear (single-threaded) vs. Forking-capable conversation models
* Strategies for context window management and compaction such as truncation, summarization, and sliding windows
* How Microsoft Agent Framework abstracts these patterns using AgentSession and ChatHistoryProvider to ensure provider-agnostic code
* Practical implementation examples for the Responses API in different modes
A new ETH Zurich study challenges the common practice of using `AGENTS.md` files with AI coding agents. LLM-generated context files decrease performance (3% lower success rate, +20% steps/costs).Human-written files offer small gains (4% success rate) but also increase costs. Researchers recommend omitting context files unless manually written with non-inferable details (tooling, build commands).They tested this using a new dataset, AGENTbench, with four agents.
RAG combines language models with external knowledge. This article explores context & retrieval in RAG, covering search methods (keywords, TF-IDF, embeddings/FAISS/Chroma), context length challenges (compression, re-ranking), and contextual retrieval (query & conversation history).
This research introduces Doc-to-LoRA (D2L), a method for efficiently processing long documents with Large Language Models (LLMs). D2L creates small, adaptable "LoRA" modules that distill key information from a document, allowing the LLM to answer questions without needing the entire document in memory. This significantly reduces latency and memory usage, enabling LLMs to handle contexts much longer than their original capacity and facilitating faster knowledge updates.
Here’s the simplest version — key sentence extraction:
<pre>
```
def extract_relevant_sentences(document, query, top_k=5):
sentences = document.split('.')
query_embedding = embed(query)
scored = »
for sentence in sentences:
similarity = cosine_sim(query_embedding, embed(sentence))
scored.append((sentence, similarity))
scored.sort(key=lambda x: x 1 » , reverse=True)
return '. '.join( s[0 » for s in scored :top_k » ])
```
</pre>
For each sentence, compute similarity to the query. Keep the top 5. Discard the rest
mcp-cli is a lightweight CLI that enables dynamic discovery of MCP servers, reducing token consumption and making tool interactions more efficient for AI coding agents.
Python implementation of Recursive Language Models for processing unbounded context lengths. Process 100k+ tokens with any LLM by storing context as variables instead of prompts.
This blog post explains that Large Language Models (LLMs) don't need to understand the Model Context Protocol (MCP) to utilize tools. MCP standardizes tool calling, simplifying agent development for developers while the LLM simply generates tool call suggestions based on provided definitions. The article details tool calling, MCP's function, and how it relates to context engineering.
This article discusses the importance of knowledge graphs in providing context for AI agents, highlighting their advantages over traditional retrieval systems in terms of precision, reasoning, and explainability.
>"This document provides a comprehensive overview of the engineering repository, which implements a systematic approach to context engineering for Large Language Models (LLMs). The repository bridges theoretical foundations with practical implementations, using a biological metaphor to organize concepts from simple prompts to complex neural field systems."