0 bookmark(s) - Sort by: Date ↓ / Title /
An analysis of how well different AI systems perform in describing images and answering questions about them. The article compares ChatGPT, Gemini, Llama, and Claude using four images: a hand, a bottle of wine, a piece of pastry, and a flower.
A Chrome extension using AI (LLaVa) to generate descriptive filenames for images when downloading them.
SmolVLM2 represents a shift in video understanding technology by introducing efficient models that can run on various devices, from phones to servers. The release includes models of three sizes (2.2B, 500M, and 256M) with Python and Swift API support. These models offer video understanding capabilities with reduced memory consumption, supported by a suite of demo applications for practical use.
The Lucid Vision Extension integrates advanced vision models into textgen-webui, enabling contextualized conversations about images and direct communication with vision models.
Qwen2.5-VL-3B-Instruct is the latest addition to the Qwen family of vision-language models by Hugging Face, featuring enhanced capabilities in understanding visual content and generating structured outputs. It is designed to directly interact with tools and use computer and phone functions as a visual agent. Qwen2.5-VL can comprehend videos up to an hour long and localize objects within images using bounding boxes or points. It is available in three sizes: 3, 7, and 72 billion parameters.
This blog post explores scaling ColPali for efficient document retrieval across large collections of PDFs using Vespa's phased retrieval and ranking pipeline, including the use of a hamming-based MaxSim similarity function.
First / Previous / Next / Last
/ Page 1 of 0