SemanticScuttle - klotz.me » Tags: architecture

Tags: architecture*

0 bookmark(s) - Sort by: Date ↓ / Title /

Timeouts, Retries and Idempotency In Distributed Systems

Sam Newman discusses the three golden rules of distributed computing and how they necessitate robust handling of timeouts, retries, and idempotency. He provides practical, data-driven strategies for implementing these principles, including using request IDs and server-side fingerprinting to create safe, resilient distributed systems.

2025-08-21 Tags: distributed systems, timeouts, retries, idempotency, resilience, microservices, system design, fault tolerance, architecture, production engineering by klotz

How the Ancient Greeks Built Their Magnificent Temples: The Art of Ancient Engineering | Open Culture

This article explores the construction and evolution of ancient Greek temples, highlighting the three classical column styles – Doric, Ionic, and Corinthian – noting that Corinthian columns originated in Roman civilization. It details the progression from early mud brick structures to the enduring stone temples, exemplified by sites like Temple C in Selinus, Sicily, and the Temple of Apollo at Didyma, Turkey. The piece emphasizes the Greeks’ innovative use of columns, often inspired by sacred forests, and references related content showcasing reconstructions and replicas of ancient Greek

2025-07-20 Tags: ancient greece, architecture, ancient history, archaeology by klotz

The Big LLM Architecture Comparison

A detailed comparison of the architectures of recent large language models (LLMs) including DeepSeek-V3, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen3, SmolLM3, and Kimi 2, focusing on key design choices and their impact on performance and efficiency.

1. **DeepSeek V3/R1**:
- Uses Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) for efficiency.
- MLA compresses key and value tensors to reduce KV cache memory usage.
- MoE activates only a subset of experts per token, improving inference efficiency.

2. **OLMo 2**:
- Focuses on transparency in training data and code.
- Uses RMSNorm layers placed after attention and feed-forward modules (Post-Norm).
- Introduces QK-Norm, an additional RMSNorm layer applied to queries and keys inside the attention mechanism.

3. **Gemma 3**:
- Employs sliding window attention to reduce memory requirements in the KV cache.
- Uses a 5:1 ratio of sliding window attention to global attention layers.
- Combines Pre-Norm and Post-Norm RMSNorm layers around the attention module.

4. **Mistral Small 3.1**:
- Outperforms Gemma 3 27B on several benchmarks while being faster.
- Uses a standard architecture with a custom tokenizer and reduced KV cache and layer count.

5. **Llama 4**:
- Adopts an MoE approach similar to DeepSeek V3 but with fewer, larger experts.
- Alternates MoE and dense modules in every other transformer block.

6. **Qwen3**:
- Comes in both dense and MoE variants.
- Dense models are easier to fine-tune and deploy, while MoE models are optimized for scaling inference.

7. **SmolLM3**:
- Uses No Positional Embeddings (NoPE), omitting explicit positional information injection.
- NoPE improves length generalization, meaning performance deteriorates less with increased sequence length.

8. **Kimi K2 and Kimi K2 Thinking**:
- Uses a variant of the Muon optimizer over AdamW.
- Kimi K2 Thinking extends the context size to 256k tokens.

9. **GPT-OSS**:
- OpenAI's first open-weight models since GPT-2.
- Uses sliding window attention and a width-versus-depth trade-off.

10. **Grok 2.5**:
- Uses a small number of large experts and a shared expert module.
- Reflects an older trend in MoE architectures.

11. **GLM-4.5**:
- Comes in two variants: a 355-billion-parameter model and a more compact 106-billion-parameter version.
- Uses a shared expert and starts with several dense layers before introducing MoE blocks.

12. **Qwen3-Next**:
- Introduces a Gated DeltaNet + Gated Attention hybrid mechanism.
- Uses Multi-Token Prediction (MTP) for efficiency.

13. **MiniMax-M2**:
- Uses per-layer QK-Norm and partial RoPE.
- More "sparse" than Qwen3, with fewer active experts per token.

14. **Kimi Linear**:
- Modifies the linear attention mechanism with Kimi Delta Attention (KDA).
- Combines Gated DeltaNet with Multi-Head Latent Attention (MLA).

15. **Olmo 3 Thinking**:
- Uses sliding window attention and YaRN for context extension.
- Comes in base, instruct, and reasoning variants.

16. **DeepSeek V3.2**:
- Adds a sparse attention mechanism to improve efficiency.
- On par with GPT-5.1 and Gemini 3.0 Pro on certain benchmarks.

17. **Mistral 3**:
- First MoE model since Mixtral in 2023.
- Partnered with NVIDIA for optimization on Blackwell chips.

18. **Nemotron 3**:
- A Transformer-Mamba hybrid architecture.
- Interleaves Mamba-2 sequence-modeling blocks with sparse MoE feed-forward layers.

19. **Xiaomi MiMo-V2-Flash**:
- Uses sliding window attention in a 5:1 ratio with global attention.
- Employs multi-token prediction (MTP) for efficiency.

20. **Arcee AI Trinity Large**:
- Uses alternating local:global attention layers, NoPE, and gated attention.
- Introduces depth-scaled sandwich norm for training stability.

2026-01-29 Tags: llm, deep learning, architecture, deepseek, olmo, gemma, mistral, llama, qwen, smollm, kimi, moe, attention, transformers, sebastian raschka by klotz

A Developer’s Guide to Building Scalable AI: Workflows vs Agents

Understanding the architectural trade-offs between autonomous agents and orchestrated workflows — because someone needs to make this decision, and it might as well be you

2025-06-28 Tags: agents, workflows, llm, software, architecture by klotz

David Van Couvering | DVC Consulting, LLC

DVC Consulting offers senior technical leadership services on an ad-hoc basis, focusing on coaching, mentorship, system design, and software development practices. Ideal for organizations seeking expert guidance without the commitment of a full-time hire, and for individual developers looking for career advancement and leadership skills development.

2025-03-20 Tags: dvc consulting, software, architecture, david van couvering by klotz

How to Choose the Architecture for Your GenAI Application

Lak Lakshmanan provides a framework for choosing the architecture of a GenAI (Generative AI) application, balancing creativity and risk. my The framework consists of eight patterns:

Generate Each Time: Invoke the LLM API for every request, suitable for high creativity and low-risk tasks like internal tools.

Response/Prompt Caching: Cache past prompts and responses to reduce cost and latency, ideal for medium creativity and low-risk tasks like internal customer support.

Pregenerated Templates: Use pre-vetted templates for repetitive tasks, reducing human review needs. Suitable for medium creativity and low-medium risk tasks.

Small Language Models (SLMs): Use smaller models for low creativity and low-risk tasks, reducing hallucinations and cost.

Assembled Reformat: Use LLMs for reformatting and summarization with pre-generated content, ensuring accuracy.

ML Selection of Template: Use machine learning to select appropriate pre-generated templates based on user context, balancing personalization with risk.

Fine-tune: Fine-tune LLMs to generate desired content while minimizing undesired outputs, addressing specific risks like brand voice or confidentiality.

Guardrails: Implement preprocessing, post-processing, and iterative prompting for high creativity and high-risk tasks, using off-the-shelf or custom-built guardrails.

This framework helps in balancing complexity, fit-for-purpose, risk, cost, and latency for each use case in GenAI applications.

2024-10-04 Tags: llm, architecture, genai, lak lakshmanan by klotz

Diagram as Code

Diagrams is a tool that lets you draw cloud system architecture using Python code, supporting major cloud providers and on-premise nodes.

2024-09-28 Tags: diagrams, python, architecture, tools, iac, production engineering by klotz

How to Easily Draw Neural Network Architecture Diagrams | by Kenneth Leung | Towards Data Science

2023-04-02 Tags: neural network, architecture, diagrams, kenneth leung, computational neuroscience, neural networks by klotz

Stack overflow: PyFlink performance compared to Scala