This article discusses how to apply vision language models (VLMs) to document understanding, covering application areas like agentic use cases, question answering, classification, and information extraction, as well as limitations like cost and processing long documents.
The article proposes a new framework, LongRAG, that aims to improve the performance of Retrieval-Augmented Generation (RAG) by using long retriever and reader components. LongRAG processes Wikipedia into larger 4K-token units, reducing the total units from 22M to 600K, thus decreasing the burden on the retriever. The top-k retrieved units (≈30K tokens) are then fed to a long-context Language Model for zero-shot answer extraction. LongRAG achieves EM of 62.7% on NQ and 64.3% on HotpotQA (full-wiki), which is on par with the state-of-the-art model.