An evaluation of Google's new multi-modal Gemma 4 model family, testing its performance across various sizes ranging from compact E2B versions to larger mixture-of-experts (MoE) models. The article explores how these models handle vision, audio, reasoning, and code generation tasks on consumer-grade hardware using tools such as LM Studio.
A comprehensive curated collection of Large Language Model (LLM) architecture figures and technical fact sheets. This gallery provides a visual and data-driven overview of modern model designs, ranging from classic dense architectures like GPT-2 to advanced sparse Mixture-of-Experts (MoE) systems and hybrid attention models. Users can explore detailed specifications including parameter scales, context windows, attention mechanisms, and intelligence indices for various prominent models.
Key features include:
* Detailed architecture fact sheets for a wide array of models such as Llama, DeepSeek, Qwen, Gemma, and Mistral.
* An architecture diff tool to compare two different model designs side-by-side.
* Comparative analysis across dense, MoE, MLA, and hybrid decoder families.
* Links to original source articles and technical reports for deeper research.
Timer-S1 is a scalable Mixture-of-Experts time series model with 8.3B parameters that uses serial scaling and novel TimeMoE blocks to improve long-term forecasting accuracy.
We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions: model architecture, dataset, and training pipeline. Timer-S1 integrates sparse TimeMoE blocks and generic TimeSTP blocks for Serial-Token Prediction (STP), a generic training objective that adheres to the serial nature of forecasting. The proposed paradigm introduces serial computations to improve long-term predictions while avoiding costly rolling-style inference and pronounced error accumulation in the standard next-token prediction. Pursuing a high-quality and unbiased training dataset, we curate TimeBench, a corpus with one trillion time points, and apply meticulous data augmentation to mitigate predictive bias. We further pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance. Evaluated on the large-scale GIFT-Eval leaderboard, Timer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model. Timer-S1 will be released to facilitate further research.
Sarvam AI is releasing Sarvam 30B and Sarvam 105B as open-source models, trained from scratch on large-scale, high-quality datasets. These models demonstrate strong reasoning, programming, and agentic capabilities, with optimizations for efficient deployment across various hardware. Sarvam 30B powers Samvaad, while Sarvam 105B powers Indus. The release includes details on the model architecture, training process, benchmark results, and inference optimizations. The models are available on AI Kosh and Hugging Face, and the article details their performance across benchmarks and in real-world applications like webpage generation, JEE problem solving, and conversational agents.
By mid-2025 China had become a global leader in open-source large language models (LLMs). According to Chinese state media, by July 2025 China accounted for 1,509 of the world’s ~3,755 publicly released LLMs, far more than any other country. This explosion reflects heavy state and industry investment in domestic AI, open licensing (often Apache- or MIT-style), and a strategic pivot by Chinese tech giants and startups toward publicly shared models. The result is a "revival" of open-source AI, with dozens of Chinese LLMs now available for download or use via Hugging Face, GitHub, or cloud APIs. These range from general-purpose foundation models dozens of billions of parameters in size to specialized chatbots and domain experts, many built on Mixture-of-Experts (MoE) architectures.
An in-depth look at the architecture of OpenAI's GPT-OSS models, detailing tokenization, embeddings, transformer blocks, Mixture of Experts, attention mechanisms (GQA and RoPE), and quantization techniques.
A 120 billion parameter OpenAI model can now run on consumer hardware thanks to the Mixture of Experts (MoE) technique, which significantly reduces memory requirements and allows processing on CPUs while offloading key parts to modest GPUs.
This article details 7 lessons the author learned while self-hosting Large Language Models (LLMs), covering topics like the importance of memory bandwidth, quantization, electricity costs, hardware choices beyond Nvidia, prompt engineering, Mixture of Experts models, and starting with simpler tools like LM Studio.