SemanticScuttle - klotz.me » klotz: evaluation+agent

klotz: evaluation* + agent*

asta-paper-finder

frozen-in-time version of our Paper Finder agent for reproducing evaluation results. This repo contains the code for the standalone Paper Finder agent. PaperFinder is our paper-seeking agent, which is intended to assist in locating sets of papers according to content-based and metadata criteria.

2025-08-26 Tags: paper finder, agent, llm, research papers, evaluation, python by klotz
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

MCP-Universe is a comprehensive benchmark designed to evaluate LLMs in realistic tasks through interaction with real-world MCP servers across 6 core domains and 231 tasks. It highlights the challenges of long-context reasoning, unfamiliar tool spaces, and cross-domain variations in LLM performance.

2025-08-25 Tags: llm, benchmark, mcp, model context protocol, evaluation, agent by klotz
From Prototype to Production: Enhancing LLM Accuracy

This article discusses methods to measure and improve the accuracy of Large Language Model (LLM) applications, focusing on building an SQL Agent where precision is crucial. It covers setting up the environment, creating a prototype, evaluating accuracy, and using techniques like self-reflection and retrieval-augmented generation (RAG) to enhance performance.

2024-12-20 Tags: llm, accuracy, evaluation, sql, agent, rag by klotz

First / Previous / Next / Last / Page 1 of 0