Learn how to build Llama 3.2-Vision locally in a chat-like mode, and explore its Multimodal skills on a Colab notebook.
This article discusses the development of multimodal Retrieval Augmented Generation (RAG) systems which allow for the processing of various file types using AI. The article provides a beginner-friendly guide with example Python code and explains the three levels of multimodal RAG systems.
SmolVLM is a compact, efficient multimodal model designed for tasks involving text and image inputs, producing text outputs. It is capable of answering questions about images, describing visual content, and functioning as a pure language model without visual inputs. Developed for on-device applications, SmolVLM is lightweight yet performs well in multimodal tasks.
Meta releases Llama 3.2, which features small and medium-sized vision LLMs (11B and 90B) alongside lightweight text-only models (1B and 3B). It also introduces the Llama Stack Distribution.
Explores recent trends in LLM research, including multi-modal LLMs, open-source LLMs, domain-specific LLMs, LLM agents, smaller LLMs, and Non-Transformer LLMs. Mentions examples such as OpenAI's Sora, LLM360, BioGPT, StarCoder, and Mamba.
This article provides a step-by-step guide on fine-tuning the Florence-2 model for object detection tasks, including loading the pre-trained model, fine-tuning with a custom dataset, and evaluating the model's performance.
The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that want to support comprehensive textual and visual understanding across a wide range of data sources. The Pipe is available as a 24/7 hosted API at thepi.pe, or it can be set up locally to let you run the compute.