Llm Evaluation

Artificial Intelligence

Why Your AI Search Evaluation Is Probably Flawed (And The right way to Fix It)

for nearly a decade, and I’m often asked, “How will we know if our current AI setup is optimized?” The honest answer? A number of testing. Clear benchmarks help you measure improvements, compare...

ASK ANA - March 9, 2026

Artificial Intelligence

Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics

generate customer journeys that appear smooth and fascinating, but evaluating whether these journeys are structurally sound stays difficult for current methods. This text introduces Continuity, Deepening, and Progression (CDP) — three deterministic, content-structure-based metrics for evaluating...

ASK ANA - January 22, 2026

Artificial Intelligence

When Does Adding Fancy RAG Features Work?

an article about overengineering a RAG system, adding fancy things like query optimization, detailed chunking with neighbors and keys, together with expanding the context. The argument against this type of work is that for a...

ASK ANA - January 13, 2026

Artificial Intelligence

Measuring What Matters with NeMo Agent Toolkit

a decade working in analytics, I firmly imagine that observability and evaluation are essential for any LLM application running in production. Monitoring and metrics aren’t just nice-to-haves. They ensure your product is functioning...

ASK ANA - January 6, 2026

Artificial Intelligence

The best way to Do Evals on a Bloated RAG Pipeline

to Constructing an Overengineered Retrieval System. That one was about constructing the whole system. This one is about doing the evals for it. Within the previous article, I went through different parts of a RAG...

ASK ANA - December 21, 2025

Artificial Intelligence

Why AI Alignment Starts With Higher Evaluation

at IBM TechXchange, I spent loads of time around teams who were already running LLM systems in production. One conversation that stayed with me got here from LangSmith, the parents who construct tooling...

ASK ANA - December 3, 2025

Artificial Intelligence

LLM-as-a-Judge: What It Is, Why It Works, and The way to Use It to Evaluate AI Models

concerning the idea of using AI to judge AI, also often called “LLM-as-a-Judge,” my response was: We live in a world where even toilet paper is marketed as “AI-powered.” I assumed this was just...

ASK ANA - November 26, 2025

Artificial Intelligence

Tips on how to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k

: 👉 👉 of my post series on retrieval evaluation measures for RAG pipelines, we took an in depth have a look at the binary retrieval evaluation metrics. More specifically, in Part 1, we went...

ASK ANA - November 12, 2025

12 Page 1 of 2

Popular categories

Artificial Intelligence10876 New Post1 My Blog1

Llm Evaluation

Recent posts

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

Supply-chain attack using invisible code hits GitHub and other repositories

Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

Popular categories