LLM EvalKit is a streamlined framework that helps developers design, test, and refine prompt‑engineering pipelines for Large Language Models (LLMs). It encompasses prompt management, dataset handling, evaluation, and automated optimization, all wrapped in a Streamlit web UI.
Key capabilities:
| Stage | What it does | Typical workflow |
|-------|-------------|------------------|
| **Prompt Management** | Create, edit, version, and test prompts (name, text, model, system instructions). | Define a prompt, load/edit existing ones, run quick generation tests, and maintain version history. |
| **Dataset Creation** | Organize data for evaluation. Loads CSV, JSON, JSONL files into GCS buckets. | Create dataset folders, upload files, preview items. |
| **Evaluation** | Run model‑based or human‑in‑the‑loop metrics; compare outcomes across prompt versions. | Choose prompt + dataset, generate responses, score with metrics like “question‑answering‑quality”, save baseline results to a leaderboard. |
| **Optimization** | Leveraging Vertex AI’s prompt‑optimization job to automatically search for better prompts. | Configure job (model, dataset, prompt), launch, and monitor training in Vertex AI console. |
| **Results & Records** | Visualize optimization outcomes, compare versions, and maintain a record of performance over time. | View leaderboard, select best optimized prompt, paste new instructions, re‑evaluate, and track progress. |
**Getting Started**
1. Clone the repo, set up a virtual environment, install dependencies, and run `streamlit run index.py`.
2. Configure `src/.env` with `BUCKET_NAME` and `PROJECT_ID`.
3. Use the UI to create/edit prompts, datasets, and launch evaluations/optimizations as described in the tutorial steps.
**Token Use‑Case**
- **Prompt**: “Problem: {{query}}nImage: {{image}} @@@image/jpegnAnswer: {{target}}”
- **Example input JSON**: query, choices, image URL, target answer.
- **Model**: `gemini-2.0-flash-001`.
**License** – Apache 2.0.