Hughes Hallucination Evaluation

Hughes Hallucination Evaluation (H2EM) Model Leaderboard: Artificial Intelligence Trends

I stumbled upon the Hughes Hallucination Evaluation (H2EM) Model Leaderboard which tracks the hallucination rate of various LLMs.

The Hughes Hallucination Evaluation Model (H2EM) Leaderboard is a public Large Language Model (LLM) leaderboard computed using Vectara’s Hallucination Evaluation Model. It evaluates how often an LLM introduces hallucinations when summarizing a document.

How does it work?

Advertisement
Elite Discovery

Using Vectara’s H2EM, they measure the occurrence of hallucinations in generated summaries. Given a source document and a summary generated by an LLM, H2EM outputs a hallucination score between 0 and 1, with 0 indicating complete hallucination and 1 representing perfect factual consistency. The model card for H2EM can be found here (P.S.: don’t ask me to explain it 😉 ).

Their evaluation dataset consists of 1006 documents from multiple public datasets, primarily CNN/Daily Mail Corpus. They generate summaries for each of these documents using submitted LLMs and compute hallucination scores for each pair of document and generated summary. Vectara has also provided information on their prior research, methodology and prompt used here (which is below the latest iteration of the leaderboard), as well as API integration details and FAQ.

The prompt used (according to the link above) is:

You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question ‘Provide a concise summary of the following passage, covering the core pieces of information described.’ <PASSAGE>’

Advertisement
Relativity

When calling the API, the <PASSAGE> token was then replaced with the source document (see the ‘source’ column in leaderboard-summaries.csv).

The Hughes Hallucination Evaluation (H2EM) Model leaderboard is available here in a static version in GitHub, with the full version here in HuggingFace.

The current leader as of yesterday? GPT-4 and GPT-4 Turbo, which both hallucinate about 3 percent of the time. The leaderboard also notes this: “While the above figures show it to be comparable to GPT4, this is due to us filtering out some documents that some of the models refuse to summarize. When comparing to GPT 4 on all summaries (both GPT4 models summarize all documents) the turbo model is around 0.3% worse than GPT4, but still better than GPT 3.5 Turbo.”

There you have it! GPT-4 hallucinates the least (for now), based on their model which (hopefully) hasn’t hallucinated the results! 😀

So, what do you think? How concerned are you about AI hallucinations? Please share any comments you might have or if you’d like to know more about a particular topic.

Image created using GPT-4’s Image Creator Powered by DALL-E, using the term “robot experiencing a visual hallucination”.

Disclaimer: The views represented herein are exclusively the views of the authors and speakers themselves, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.


Discover more from eDiscovery Today by Doug Austin

Subscribe to get the latest posts sent to your email.

Leave a Reply