This blog post demonstrates how to create a reusable retrieval evaluation dataset using an LLM to judge query-document pairs. It discusses the process, including building a small labeled dataset, aligning LLM judgments with human preferences, and using the LLM to judge a large set of queries and documents.