Docling is a tool that parses documents and exports them to desired formats like Markdown and JSON. It supports various document formats and provides advanced PDF understanding, metadata extraction, and integration with LlamaIndex and LangChain for RAG / QA applications.
Docling is a tool that parses documents and exports them to desired formats like Markdown and JSON. It supports various document formats including PDF, DOCX, PPTX, Images, HTML, AsciiDoc, and Markdown.
A new plugin for LLM, llm-jq, generates and executes jq programs based on human-language descriptions, allowing users to manipulate JSON data without needing to write jq syntax.
The author records a screen capture of their Gmail account and uses Google Gemini to extract numeric values from the video.
A comprehensive guide testing structured output capabilities of Google Gemini, Anthropic Claude, and OpenAI GPT, with OpenAI GPT-4o offering the most consistent structured outputs right out of the box.
Weaviate introduces StructuredRAG, a benchmark to evaluate LLMs' ability to generate reliable JSON outputs. The study finds that while LLMs perform well on simpler tasks, they struggle with more complex outputs.
We introduce NuExtract, a lightweight text-to-JSON LLM. NuExtract allows to extract arbitrarily complex information from text and turns it into structured data.
This article explores NuExtract, a family of Small Language Models (SLMs) for extracting structured data from text. The author, Fabio Matricardi, discusses using NuExtract to process candidate CVs for a database and highlights its benefits for privacy protection and running on less powerful computers.
NuExtract is a fine-tuned version of phi-3-mini for information extraction. It requires a JSON template describing the information to extract and an input text. Provides tiny (0.5B) and large (7B) versions.
NuExtract is a 3.8B parameter information extraction model fine-tuned from phi-3, designed to extract structured data from text using a JSON template.