Galileo Figaro, magnifico! 🎵 Which AI agent is the best? A new “Agent Leaderboard” from Galileo AI evaluates several models and ranks them!
Last week, Galileo launched a new Agent Leaderboard on Hugging Face, an open-source AI platform where users can build, train, access, and deploy AI models. The leaderboard is meant to help people learn how AI agents perform in real-world business applications and help teams determine which agent best fits their needs.
As discussed in this article on ZDNet, On the leaderboard, you can find information about a model’s performance, including its rank and score. At a glance, you can also see more basic information about the model, including vendor, cost, and whether it’s open source or private.
The leaderboard currently features “the 17 leading LLMs,” including models from Google, OpenAI, Mistral, Anthropic, and Meta. It is updated monthly to keep up with ongoing releases, which have been occurring frequently. So, where’s DeepSeek? In a note, the leaderboard says: “DeepSeek V3 and R1 were excluded from rankings due to limited function support.”
To determine the results, Galileo uses benchmarking datasets, including the BFCL (Berkeley Function Calling Leaderboard), τ-bench (Tau benchmark), Xlam, and ToolACE, which test different agent capabilities. The leaderboards then turn this data into an evaluation framework that covers real-world use cases.
The models are ranked by Galileo’s Tool Selection Quality (TSQ) metric, which was developed (per Galileo’s blog post introducing the leaderboard) “to assess agents’ tool call performance, evaluating tool selection accuracy and effectiveness of parameter usage. This framework determines whether agents appropriately utilize tools for tasks while also identifying situations where tool usage is unnecessary.”
For each model, the leaderboard shows the Rank, the Model name, Type (Private or Open Source), Vendor, Cost (I/O) and Avg Category Score (TSQ).
So, which AI agent is the best? According to Galileo’s leaderboard, it’s Google’s gemini-2.0-flash-001 with a TSQ of 0.938! OpenAI’s gpt-4o-2024-11-20 is second with a TSQ of 0.900. Both models received what Galileo calls “Elite Tier Performance” status, which is given to models with a score of .9 or higher. They are followed by two more Google models – gemini-1.5-flash and gemini-1.5-pro – third and fourth respectively. OpenAI models o1-2024-12-17 and o3-mini-2025-01-31 are fifth and sixth, respectively, with Mistral’s mistral-small-2501 the first open source model at seventh.
Notably, Google’s Gemini 2.0 was consistent across all of the evaluation categories and balanced impressive consistency performance across all categories with cost-effectiveness, according to the post, at a cost of $0.15/$0.6 per million tokens. Although GPT-4o was a close second, it has a much higher price point at $2.5/$10 per million tokens.
You can sort the leaderboard by cost as well, which puts Mistral’s ministral-8b-2410 as the least expensive model at $0.10/$0.10 per million tokens. Google’s gemini-1.5-flash is second at $0.07/$0.30 per million tokens – to go along with its ranking of third in performance – making it an interesting choice for both high performance and low cost! Oh, mamma mia, mamma mia! 🤣
The model is updated monthly to keep up with ongoing releases, which have been occurring frequently.
So, what do you think? Do you agree with Galileo’s determination of which AI agent is the best? Please share any comments you might have or if you’d like to know more about a particular topic.
Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by my employer, my partners or my clients. eDiscovery Today is made available solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Today should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.
Discover more from eDiscovery Today by Doug Austin
Subscribe to get the latest posts sent to your email.








oh, well, if they used the Tau benchmark AND Xlam, then I know it must be accurate
[…] Quality (TSQ): TSQ assesses how effectively an AI agent chooses and uses tools to solve tasks. Galileo AI’s Leaderboard leverages TSQ to benchmark tool-calling accuracy and parameter […]