SemanticScuttle - klotz.me » klotz: multimodal+llm

klotz: multimodal* + llm*

Introducing Qwen2.5-VL: Advanced Vision-Language Model Capabilities

Qwen2.5-VL, the latest vision-language model from Qwen, showcases enhanced image recognition, agentic behavior, video comprehension, document parsing, and more. It outperforms previous models in various benchmarks and tasks, offering improved efficiency and performance.

2025-02-09 Tags: qwen2.5-vl, vision-language model, image recognition, document parsing, ocr, multimodal, llm, machine learning by klotz

Chat with Your Images Using Llama 3.2-Vision Multimodal LLMs

Learn how to build Llama 3.2-Vision locally in a chat-like mode, and explore its Multimodal skills on a Colab notebook.

2024-12-08 Tags: llama 3.2-vision, multimodal, llm, vision, machine learning by klotz

Multimodal RAG: Process Any File Type with AI

This article discusses the development of multimodal Retrieval Augmented Generation (RAG) systems which allow for the processing of various file types using AI. The article provides a beginner-friendly guide with example Python code and explains the three levels of multimodal RAG systems.

2024-12-07 Tags: multimodal, rag, llm, python, search by klotz

HuggingFaceTB/SmolVLM-Instruct

SmolVLM is a compact, efficient multimodal model designed for tasks involving text and image inputs, producing text outputs. It is capable of answering questions about images, describing visual content, and functioning as a pure language model without visual inputs. Developed for on-device applications, SmolVLM is lightweight yet performs well in multimodal tasks.

2024-11-28 Tags: smolvlm, multimodal, llm, t, huggingface by klotz

Llama 3.2 Guide: How It Works, Use Cases & More

Meta releases Llama 3.2, which features small and medium-sized vision LLMs (11B and 90B) alongside lightweight text-only models (1B and 3B). It also introduces the Llama Stack Distribution.

2024-09-29 Tags: llama 3.2, multimodal, vision, llm by klotz

The Next Big Trends in Large Language Model (LLM) Research

Explores recent trends in LLM research, including multi-modal LLMs, open-source LLMs, domain-specific LLMs, LLM agents, smaller LLMs, and Non-Transformer LLMs. Mentions examples such as OpenAI's Sora, LLM360, BioGPT, StarCoder, and Mamba.

2024-07-05 Tags: llm, multimodal, agent, small language models, domain language models by klotz

How to Fine-tune Florence-2 for Object Detection Tasks

This article provides a step-by-step guide on fine-tuning the Florence-2 model for object detection tasks, including loading the pre-trained model, fine-tuning with a custom dataset, and evaluating the model's performance.

2024-06-26 Tags: florence-2, object detection, multimodal, llm, vision, microsoft, fine tuning by klotz

emcf/thepipe: Feed PDFs, URLs, Slides, YouTube, GitHub, and more into Vision-Language models with one line of code ⚡

The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that want to support comprehensive textual and visual understanding across a wide range of data sources. The Pipe is available as a 24/7 hosted API at thepi.pe, or it can be set up locally to let you run the compute.

2024-05-04 Tags: github, thepipe, vision-language models, multimodal, llm by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: multimodal* + llm*

Linked Tags

Related Tags