Tags: multimodal*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. LLMII uses a local LLM to label metadata and index images. It does not rely on a cloud service or database. A visual language model runs on your computer and is used to create captions and keywords for images in a directory tree. The generated information is then added to each image file's metadata.
  2. This post explores how developers can leverage Gemini 2.5 to build sophisticated robotics applications, focusing on semantic scene understanding, spatial reasoning with code generation, and interactive robotics applications using the Live API. It also highlights safety measures and current applications by trusted testers.
  3. Alibaba Cloud released its Qwen2.5-Omni-7B multimodal AI model, designed for cost-effective AI agents and capable of processing various inputs like text, images, audio, and video.
    2025-03-27 Tags: , , , , by klotz
  4. Mistral Small 3.1 is an open-source multimodal AI model optimized for consumer hardware, offering strong performance in text and image processing, multilingual capabilities, and a balance between performance and accessibility. While excelling in many areas, it has limitations in long-context tasks and Middle Eastern language support.
  5. Mistral Small 3.1 is a cutting-edge, open-source AI model released by Mistral AI, designed for efficiency and excelling in multimodal and multilingual tasks. It supports a 128k token context window and is optimized for real-time conversational AI and domain-specific fine-tuning.
  6. Qwen2.5-VL, the latest vision-language model from Qwen, showcases enhanced image recognition, agentic behavior, video comprehension, document parsing, and more. It outperforms previous models in various benchmarks and tasks, offering improved efficiency and performance.
  7. Learn how to build Llama 3.2-Vision locally in a chat-like mode, and explore its Multimodal skills on a Colab notebook.
  8. This article discusses the development of multimodal Retrieval Augmented Generation (RAG) systems which allow for the processing of various file types using AI. The article provides a beginner-friendly guide with example Python code and explains the three levels of multimodal RAG systems.
    2024-12-07 Tags: , , , , by klotz
  9. SmolVLM is a compact, efficient multimodal model designed for tasks involving text and image inputs, producing text outputs. It is capable of answering questions about images, describing visual content, and functioning as a pure language model without visual inputs. Developed for on-device applications, SmolVLM is lightweight yet performs well in multimodal tasks.
    2024-11-28 Tags: , , , , by klotz
  10. Meta releases Llama 3.2, which features small and medium-sized vision LLMs (11B and 90B) alongside lightweight text-only models (1B and 3B). It also introduces the Llama Stack Distribution.
    2024-09-29 Tags: , , , by klotz

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "multimodal"

About - Propulsed by SemanticScuttle