SemanticScuttle - klotz.me » Tags: evaluation

Tags: evaluation*

0 bookmark(s) - Sort by: Date ↓ / Title /

Arize Phoenix is an open-source observability library for AI experimentation, evaluation, and troubleshooting, built by Arize AI.

2025-02-08 Tags: arize phoenix, ai, observability, experiments, evaluation, troubleshooting, visualization, opentelemetry, openinference, production engineering, data engineering by klotz

From Prototype to Production: Enhancing LLM Accuracy

This article discusses methods to measure and improve the accuracy of Large Language Model (LLM) applications, focusing on building an SQL Agent where precision is crucial. It covers setting up the environment, creating a prototype, evaluating accuracy, and using techniques like self-reflection and retrieval-augmented generation (RAG) to enhance performance.

2024-12-20 Tags: llm, accuracy, evaluation, sql, agent, rag by klotz

We Ran Over Half a Million Evaluations on Quantized LLMs: Here's What We Found

This article discusses the extensive evaluation of quantized large language models (LLMs) by Neural Magic, finding that quantized LLMs maintain competitive accuracy and efficiency with their full-precision counterparts.

Quantization Schemes: Three different quantization schemes were tested: W8A8-INT, W8A8-FP, and W4A16-INT, each optimized for different hardware and deployment scenarios.
Accuracy Recovery: The quantized models demonstrated high accuracy recovery, often reaching over 99%, across a range of benchmarks, including OpenLLM Leaderboard v1 and v2, Arena-Hard, and HumanEval.
Text Similarity: Text generated by quantized models was found to be highly similar to that generated by full-precision models, maintaining semantic and structural consistency.

2025-02-27 Tags: quantization, llm, evaluation, neural magic by klotz

Metrics to Evaluate a Classification Machine Learning Model

This article explores various metrics used to evaluate the performance of classification machine learning models, including precision, recall, F1-score, accuracy, and alert rate. It explains how these metrics are calculated and provides insights into their application in real-world scenarios, particularly in fraud detection.

2024-08-01 Tags: machine learning, classification, metrics, evaluation, precision, recall, f1-score, accuracy, alert rate, fraud detection, llm by klotz

End-to-end LLM Workflows Guide

This guide demonstrates how to execute end-to-end LLM workflows for developing and productionizing LLMs at scale. It covers data preprocessing, fine-tuning, evaluation, and serving.

2024-06-21 Tags: llm, workflows, data preprocessing, fine-tuning, evaluation, serving, ray, anyscale by klotz

New Trends in LLM Architecture

Discusses the trends in Large Language Models (LLMs) architecture, including the rise of more GPU, more weights, more tokens, energy-efficient implementations, the role of LLM routers, and the need for better evaluation metrics, faster fine-tuning, and self-tuning.

2024-06-01 Tags: llm, machine learning, deep learning, transformers, self-tuning, evaluation by klotz

Langfuse - Open Source LLM Engineering Platform

Langfuse is an open-source LLM engineering platform that offers tracing, prompt management, evaluation, datasets, metrics, and playground for debugging and improving LLM applications. It is backed by several renowned companies and has won multiple awards. Langfuse is built with security in mind, with SOC 2 Type II and ISO 27001 certifications and GDPR compliance.

2024-05-23 Tags: lamgfuse, llm, prompt engineering, evaluation, datasets, metrics, observability by klotz

Evaluate anything you want | Creating advanced evaluators with LLMs

Discover how to build custom LLM evaluators for specific real-world needs

2024-04-20 Tags: llm, evaluation by klotz

Evaluating Classification Models: Understanding the Confusion Matrix and ROC Curves

Learn about the importance of evaluating classification models and how to use the confusion matrix and ROC curves to assess model performance. This post covers the basics of both methods, their components, calculations, and how to visualize the results using Python.

2024-04-08 Tags: machine learning, classification models, confusion matrix, roc curves, evaluation, true positives, true negatives, false positives, false negatives, accuracy, precision, recall, specificity, f1 score, balanced accuracy by klotz

how to compare a classification model to a baseline

A ready-to-run tutorial in Python and scikit-learn to evaluate a classification model compared to a baseline model

2024-02-22 Tags: classification, evaluation, baseline, machine learning, roc by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: evaluation*

Linked Tags

Related Tags