Docling is a tool that parses documents and exports them to desired formats like Markdown and JSON. It supports various document formats including PDF, DOCX, PPTX, Images, HTML, AsciiDoc, and Markdown.
A new plugin for LLM, llm-jq, generates and executes jq programs based on human-language descriptions, allowing users to manipulate JSON data without needing to write jq syntax.
The author records a screen capture of their Gmail account and uses Google Gemini to extract numeric values from the video.
A comprehensive guide testing structured output capabilities of Google Gemini, Anthropic Claude, and OpenAI GPT, with OpenAI GPT-4o offering the most consistent structured outputs right out of the box.
Weaviate introduces StructuredRAG, a benchmark to evaluate LLMs' ability to generate reliable JSON outputs. The study finds that while LLMs perform well on simpler tasks, they struggle with more complex outputs.
We introduce NuExtract, a lightweight text-to-JSON LLM. NuExtract allows to extract arbitrarily complex information from text and turns it into structured data.
This article explores NuExtract, a family of Small Language Models (SLMs) for extracting structured data from text. The author, Fabio Matricardi, discusses using NuExtract to process candidate CVs for a database and highlights its benefits for privacy protection and running on less powerful computers.
NuExtract is a fine-tuned version of phi-3-mini for information extraction. It requires a JSON template describing the information to extract and an input text. Provides tiny (0.5B) and large (7B) versions.
NuExtract is a 3.8B parameter information extraction model fine-tuned from phi-3, designed to extract structured data from text using a JSON template.
Tutorial on enforcing JSON output with Llama.cpp or the Gemini’s API for structured data generation from LLMs.