Evaluating LLM summaries at scale: What we tried and what worked

A practical account of building a hallucination detection pipeline for financial news summarisation, and what each metric actually caught.

The client, a financial news and trading platform serving over 100,000 daily users, had a tradable universe of 3 million assets with thousands of new reports arriving every day. A team of human analysts covering that volume is not feasible, so LLMs made it possible. But in a regulated industry where a bad summary can inform a bad trade, we needed to know when to trust the output.

The research we reviewed suggested that factual errors appear in roughly 10-20% of LLM-generated summaries, and our own measurements on a production pipeline confirmed the same range: incorrect numbers, invented dates, reversed cause-and-effect, misattributed quotes.

We tried several evaluation metrics, each of which caught different failure modes, and ended up with a layered pipeline rather than a single score. The approach followed the same iterate-and-measure cycle used in prompt engineering more broadly: set a goal, write a prompt, evaluate, apply a technique, re-evaluate.

Why this is hard

A language model is not a retrieval system; it does not look up facts, it predicts the next token. When the context is ambiguous or two plausible figures compete, it picks whichever continuation fits the probability distribution it learned during training. The result reads fine, but the numbers might not be.

Fluency and correctness are orthogonal: a summary can be grammatically perfect and factually wrong. The goal of hallucination detection is to measure faithfulness: does every claim in the summary follow from the source? No single metric answers that, which is why we ran several in sequence.

Surface-level metrics: ROUGE, BLEU, BERTScore, EMD

When you are evaluating summaries across millions of assets, the first filter needs to be cheap. These four metrics are fast enough to run on every output.

ROUGE measures n-gram recall: how much of the reference appears in the candidate.

BLEU measures precision: how much of the candidate appears in the reference. Both are fast, interpretable, and already integrated into most evaluation tooling. The limitation is just as clear. "Revenue fell to €4M" and "Revenue grew to €4M" are one word apart. ROUGE-2 rates them as nearly identical, because the words overlap even when the meaning does not.

BERTScore moves from token overlap to semantic similarity. It embeds each token using a pre-trained language model, computes pairwise cosine similarities, and takes a greedy-matched F1. This handles paraphrasing: "constructed" and "built" score well against each other. But two texts can still have high BERTScore while disagreeing on a specific number. "The bridge is 200m long" and "the bridge is 800m long" embed similarly. Semantic proximity is not fact verification.

For a more granular signal, we computed dense embeddings at sentence and paragraph level and compared them using Earth Mover's Distance (EMD). EMD treats each set of chunk embeddings as a distribution and measures the minimum cost of rearranging one into the other. A faithful summary produces embeddings that cluster near their source counterparts: low transport cost. A hallucinated sentence lands far away, contributing a large spike even if the rest of the summary is fine. We ran EMD at two chunk sizes (sentence-level and paragraph-level, with 50% sliding-window overlap) and took the maximum score. The two granularities capture different things: a sentence embedding represents a specific claim ("revenue fell 8%"), while a paragraph embedding represents the broader argument ("the company underperformed"). A hallucinated detail can sit comfortably inside a faithful paragraph, or a faithful sentence can appear inside a paragraph whose overall meaning has drifted. Checking both and taking the worst score caught more failures than either alone.

These metrics stayed in the pipeline as cheap first-pass filters. A very low ROUGE-1 score (below ~0.3) was worth investigating, but a high score told us almost nothing about factual accuracy.

NER cross-referencing

All of the metrics above treat tokens equally. In financial summarisation, not all tokens matter equally: getting a company name or a number wrong can both cause a trade.

We used spaCy to extract named entities from both the source and the summary, then checked whether every entity in the summary had an antecedent in the source:

import spacy

nlp = spacy.load("de_core_news_lg")

def extract_entities(text):
    doc = nlp(text)
    return {
        "names": {ent.text for ent in doc.ents if ent.label_ in ("PER", "ORG", "GPE")},
        "numbers": {ent.text for ent in doc.ents if ent.label_ in ("CARDINAL", "MONEY", "PERCENT")},
        "dates": {ent.text for ent in doc.ents if ent.label_ == "DATE"},
    }

source_entities = extract_entities(source_text)
summary_entities = extract_entities(summary_text)

# flag any number in the summary that does not appear in the source
hallucinated_numbers = summary_entities["numbers"] - source_entities["numbers"]

A simple rule (flag any number in the summary that does not appear verbatim in the source) caught a high fraction of numeric errors. LLMs occasionally fabricate statistics outright, but the more common failures are subtler: rounding a figure, converting units, computing a percentage the source never stated, or rephrasing a sentence in a way that changes the number. Not all of these are errors. If the source says "EUR 4.2 million" and the summary says "EUR 4.2M", the mismatch is cosmetic. But we could not distinguish cosmetic mismatches from real ones automatically, so we needed to keep the false-positive rate low enough that flagged items were worth reviewing.

We added an explicit prompt instruction to reduce the problem at the source:

Do not perform arithmetic on the source data. If the source states that revenue grew from €4M to €6M, write exactly that. Do not write "revenue grew by 50%". Even if the arithmetic is correct, the derived figure is one more thing that can go wrong.

This is an example of a negative output-quality constraint: a rule that tells the model what not to do, targeting a specific failure mode observed in earlier outputs. Negative constraints like this tend to be effective when they address a concrete, recurring problem rather than a vague concern.

One edge case that cost almost nothing to guard against: the model outputting in the wrong language. In our pipeline this happened roughly once in every 5,000 summaries, which is rare but enough to reach a user. A langdetect check on every output caught it.

Information density

The NER extraction checks which entities appear in the summary, but not whether the summary is doing useful work with them. A summary that mentions the right names and numbers can still be padded with filler.

Information density is facts per word, where a fact is any noun, number, or known entity, the same items spaCy already tags. Compute it on both the source and the summary. A summary compresses text (fewer words, same facts), so if it is doing its job, density goes up.

In our dataset, source articles averaged around 0.3 facts per word, and good summaries came in around 0.4. The summary was shorter but carried the same factual content, so each word was doing more work.

When the summary is less dense than the source, something has gone wrong. The source article is already prose with its own connective tissue and filler. If the summary has more padding per fact than that, it’s diluting. We saw two patterns:

Padding. The model generates fluent but content-free sentences. Word count grows, fact count stays flat.
Vague generalisation. The model replaces specific claims with hand-wavy language. "Operating costs rose 8% year-on-year, driven by increased headcount" becomes "costs increased", covering the same topic with fewer facts and more words to say less.

The density comparison also surfaced a product question we had not anticipated. Some sources on the platform were ticker-style feeds, close to raw data: "TSLA 1% up new tesla model / google 2% up due to layoffs". These hit around 0.9 facts per word, meaning almost every word is a fact, and any LLM summary of this content necessarily reduces density, because the model adds articles, connective words, and sentence structure. Should high-density sources be passed through as-is, summarised despite the inevitable density loss, or expanded into more readable prose? The density score could have served as a pre-summarisation gate, but the product team's priority was readability, so everything went through the summariser regardless.

Post-summarisation, the source-vs-summary density comparison still earned its place: it flagged summaries that were diluting rather than compressing.

QAEval

This was the metric that gave us the most actionable feedback. Where BERTScore and ROUGE produce a scalar, QAEval tells you which specific claims in a summary are unsupported.

The approach is based on QAEval (Deutsch et al., 2021), which uses noun-phrase extraction and fine-tuned BART model (Lewis et al., 2020) for question generation and answering. We replaced these with LLM prompting throughout and adapted the method to measure faithfulness rather than completeness: questions are generated from the source document, not from a reference summary. The result turns faithfulness into a comprehension test. Generate questions from the source, then check whether a model can answer them using only the summary.

Generating questions

We prompted an LLM to generate single-choice questions targeting the most fact-rich parts of the source:

Given the following financial article, generate 10 single-choice
comprehension questions. Each question should:

- Target a specific, verifiable claim (a number, date, name, or causal relationship)
- Have exactly 4 options: one correct answer, two plausible distractors, and "I don't know"
- Randomise the position of the correct answer across questions

Do not generate questions about the overall topic or theme. Focus
on claims that could be verified or falsified.

A generated question might look like:

Q: What was the year-on-year change in operating costs reported
   by the company?

A) Decreased by 12%
B) Increased by 8%
C) Remained unchanged
D) I don't know

Correct: B
Source sentence: "Operating costs rose 8% year-on-year, driven
primarily by increased headcount."

Running the test

The whole flow reduces to four steps:

summary = summarise(llmModel, referenceTexts)

(questionnaire, solution) = createQuestionnaire(llmModel, referenceTexts)

answers = takeTest(llmModel, questionnaire, summary)

sectionsAndTheirScores = calculateScore(answers, solution)

A correct answer with a cited sentence means that sentence is likely faithful. A correct answer without a plausible citation means the model may be relying on world knowledge rather than the summary. A wrong answer means the summary may be missing or contradicting the relevant fact. An "I don't know" response means the summary does not cover that fact.

The quality of QAEval depends on the quality of the question-generation prompt. The same principles that improve any LLM output (clear instructions, specific constraints, worked examples) apply here too: the evaluation prompt itself needs engineering.

The single-choice format with four options means scoring is deterministic, with no need for another LLM to judge whether a free-form answer is "close enough". The "I don't know" option is equally important, because without it the model attempts an answer using background knowledge, and a correct guess masks a gap in the summary.

QAEval worked both as a development tool and as a production quality gate. Inspecting which questions failed, and which summary sentences were implicated, was the fastest way to diagnose prompt problems during development. In production, across a sample of around 1,000 summaries with roughly 16 bulletpoints each, QAEval filtered out 1-2 incorrect statements per summary that the cheaper metrics had missed. The false-positive rate was high enough that flagged items still needed review, but low enough that reviewing them was worth the time.

LLM-as-judge

We tried prompting a model to rate faithfulness on a numerical scale or produce a binary pass/fail verdict. The problem is circularity, because there is no independent way to verify whether the judge itself is correct. Larger models judge more reliably than smaller ones, but larger models also produce better summaries to begin with. At some point you are paying for a better summariser, not a faithfulness check. Asking the judge for reasoning alongside its score improved consistency but did not change the problem.

For style evaluation ("does this read well?"), LLM-as-judge is reasonable. For factual correctness, where the model needs to do the thing you are trying to verify, we were not convinced. It stayed in the pipeline as a final check on outputs that had already passed the other gates.

The pipeline

The metrics did not replace each other but layered, ordered by cost per call:

Language detection: near-zero cost, catches wrong-language output
ROUGE/BLEU: fast, flags severe surface-level deviation
NER cross-referencing: targets numbers and entities specifically
BERTScore + EMD: semantic checks at multiple granularities
QAEval: run only on summaries that passed the cheaper gates

The first four gates are programmatic: deterministic checks in code, fast and cheap to run. QAEval is model-based, requiring LLM inference, which makes it flexible but expensive. Running it on every summary across 3 million assets would have been prohibitively expensive, but running it only on summaries that cleared the cheaper filters made it tractable. The cheaper metrics did not need to be precise, they just needed to eliminate the obvious failures before the expensive check ran.

Conclusion

LLM fact-checking remains an open problem. Both our results and recent research agree: there is no single reliable metric, and even strong approaches only capture parts of the failure space. Current methods still struggle with subtle inconsistencies, domain shifts, and even basic evaluation, largely because existing metrics measure similarity rather than true factual consistency. A layered approach is still the only viable strategy.

The reality is that it highly depends on the use-case, and this is where can come in and help our costumers understand the plethora of tools and techniques, as well as identifying the most fit solution.

In this case, we did not require completeness and could prune aggressively: it was often preferable to drop a summary than risk passing a subtly incorrect one. This "better safe than sorry" bias turns imperfect signals into a useful system; cheap filters remove obvious failures, and more expensive checks focus on the remainder.

The pipeline makes errors rarer and easier to detect.

If you'd like to discuss your use case or explore where AI fits into your product, reach out.

Evaluating LLM summaries at scale: What we tried and what worked

Why this is hard

Surface-level metrics: ROUGE, BLEU, BERTScore, EMD

NER cross-referencing

Information density

QAEval

Generating questions

Running the test

LLM-as-judge

The pipeline

Conclusion

View more blogs

Evaluating LLM summaries at scale: What we tried and what worked

YLD supports US broadcast media company in enhancing mobile video-on-demand viewer experience

Get in touch