Skip to main content:

Published: Dec 15, 2025

How to evaluate AI systems: A practical guide to toolkits, gaps and governance


AI adoption is accelerating across every sector, but so are the risks. Organisations now face increasing pressure to evaluate AI systems against evolving regulations, emerging threat vectors, and rising expectations of transparency and safety. However, today’s evaluation landscape is fragmented, with toolkits, benchmarks, and red-teaming frameworks measuring different things in different ways, with varying degrees of usability.

This article summarises the current landscape, highlights the gaps that make AI evaluation difficult in practice, and outlines a pragmatic approach enterprises can take. A downloadable infographic accompanies this article. It provides an overview of the key toolkits, where they fall short and why a lifecycle approach to AI evaluation is optimal. 
 

Key takeaways

  • AI evaluation today is fragmented, inconsistent, and difficult to operationalise across the AI lifecycle.
  • Existing toolkits tend to specialise in narrow principles, forcing organisations to stitch together multiple tools for meaningful coverage.
  • Regulatory expectations are increasing globally, making structured lifecycle-based evaluation essential.
  • NCS recommends a practical model: identify the lifecycle stage, define the right metrics, then select (and combine) the most appropriate toolkits.
     

The current landscape of AI system evaluation


Why organisations are struggling with AI risks

As AI systems become more deeply embedded in products, workflows, and customer-facing experiences, the risks have grown more visible. Incidents involving hallucinations, flawed code generation, misinformation, and misaligned outputs have shown how unpredictable AI can be in real-world environments.

These challenges are not theoretical. Industry examples across sectors show how models can inadvertently reveal sensitive information, generate misleading outputs, or create vulnerabilities that can be exploited. Attackers are also leveraging AI to scale fraud, impersonation, and content manipulation, including deepfake-enabled social engineering attempts.

To build trust, organisations need greater clarity, stronger controls, and consistent methods for evaluating AI behaviour. This is why many have turned to emerging toolkits and frameworks to benchmark model performance, test for bias, assess robustness, or simulate adversarial attacks.
 

A diverse but fragmented toolkit ecosystem

The market now includes government-led initiatives, commercial open-source packages, and academic benchmarking suites. More than 80 percent of the toolkits surveyed are open-source and focus on principles such as fairness, explainability, safety, and security. Examples include:

  • AI Verify – A WebUI developed by Singapore’s IMDA to evaluate AI systems along multiple dimensions.
  • Project Moonshot – A benchmarking and red-teaming toolkit for Large Language Models.
  • Veritas Toolkit – A MAS initiative to support fairness and transparency in financial AI systems.
  • MLCommons Safety Benchmarks – Industry benchmarking for safety-related model performance. 

Beyond these government-led initiatives, several widely used open-source toolkits address specific evaluation needs. Microsoft’s Fairlearn helps teams measure and mitigate model fairness issues, while IBM’s AIF360 and AIX360 provide comprehensive libraries for fairness testing and explainability. Meta’s Cyberseceval focuses on cybersecurity risks in LLMs, and Microsoft’s PyRIT enables automated red-teaming for generative AI systems. Academic tools such as HELM and AIR-Bench benchmark model safety and performance using standardised datasets, while the EleutherAI LM Eval framework provides a unified framework for testing generative models across multiple tasks.

This diversity of tools is useful but can also be overwhelming. Each toolkit approaches AI evaluation differently, uses different metrics, and is built for different stages of the AI lifecycle.
 

Regulatory expectations continue to rise

Governments are moving beyond voluntary guidelines into enforceable regulation. Examples include:

As regulatory expectations tighten, organisations need clarity on how to measure compliance across the AI lifecycle, from data management to model training, deployment, monitoring, and incident response.
 

Gaps in today’s AI evaluation tools

Despite the growing number of frameworks, toolkits and benchmarks available, most organisations still struggle to obtain a clear, comprehensive view of AI risk. Several structural gaps explain why.

1. Difficulty interpreting results

Many toolkits generate scores or metrics that can be hard to interpret without deep technical expertise.

For example, fairness libraries often offer multiple statistical measures, each with different implications. Benchmarking tools may produce performance or risk scores but provide limited clarity on what those scores actually mean for a specific use case.

Tools such as Moonshot, AIR-Bench and LM Eval often generate capability or safety scores without accompanying guidance, making it hard for teams to interpret real-world implications or translate results into operational decisions.

The result is that business and compliance stakeholders often receive outputs they cannot easily translate into operational decisions.

2. Narrow or incomplete evaluation coverage

Many tools were designed for depth rather than breadth.

  • Some focus only on fairness.
  • Some specialise in explainability.
  • Some only perform adversarial red-teaming.
  • Some benchmark capabilities but not safety.

This pattern appears across the broader ecosystem. Fairlearn focuses primarily on fairness metrics, AIX360 on explainability, while AIF360 offers depth in statistical fairness but limited coverage of other principles. Benchmarking tools such as HELM, AIR-Bench, and LM Eval measure capability or safety but do not assess areas like privacy or transparency. Conversely, red-teaming tools such as PyRIT, Giskard, or Cyberseceval identify harmful behaviour or cybersecurity risk but do not provide broader lifecycle evaluation.

Even broader tools like AI Verify offer multiple dimensions but may provide fewer specialised metrics than focused libraries. This makes a “single toolkit” approach unrealistic; organisations must combine tools to achieve full lifecycle coverage.

3. Inconsistent methodologies across tools

Different toolkits measure similar principles—fairness, robustness, safety—using different datasets, thresholds, or mathematical definitions.

This inconsistency can lead to contradictory conclusions when evaluating the same model. Without a harmonised evaluation standard, organisations struggle to compare results across toolkits in a meaningful way.

4. Technical complexity and usability issues

Some toolkits require advanced configuration, dependency management or specialised knowledge to deploy effectively. Examples include:

  • Python package dependencies
  • Command-line setup
  • Limited documentation for niche tools
  • Dataset formatting requirements

Tools like HELM, for example, can require complex environment setup. Other tools face similar usability challenges. For instance, PyRIT and Cyberseceval require familiarity with security-oriented workflows, Giskard relies on automated test suite configuration, and libraries like AIF360 and Fairlearn demand expertise in selecting and interpreting statistical fairness metrics. These factors can increase the learning curve for teams without deep technical backgrounds.

5. No guidance on “what to do next”

Even when an organisation successfully evaluates an AI model, most toolkits stop at presenting the results. They do not advise on:

  • Prioritising risks
  • Selecting mitigation strategies
  • Incorporating findings into governance workflows
  • Aligning with regulatory frameworks
  • Deciding which lifecycle areas require re-testing

This leaves AI teams with a gap between measurement and action, thereby increasing the need for expert interpretation and structured governance.

 
How to choose the right AI evaluation tools: a practical, lifecycle-based approach

Choosing the right AI evaluation tools starts with understanding what you need to test and when you need to test it. No single toolkit covers every principle: capability, fairness, robustness, safety, security. The key is matching the right tool to the right stage of the AI lifecycle. Here are the steps for how organisations can confidently select the appropriate benchmarks, principle-based tests, and red-teaming tools based on the AI model’s maturity, the risks involved, and the outcomes they need to validate.

1. Identify the AI lifecycle stage

Start with clarity on where the model sits:

  • Data preparation
  • Model development
  • Evaluation
  • Deployment
  • Monitoring
  • Incident response

Different stages require different tests and controls.

2. Define the right metrics for that stage

Each lifecycle stage has its own relevant principles:

  • Fairness during data processing
  • Explainability during model development
  • Robustness before deployment
  • Red-teaming during pre-production testing
  • Safety monitoring post-deployment

Without clear metrics, tool selection becomes arbitrary.

3. Select the most suitable toolkit(s) for each goal

Rather than forcing one toolkit to cover everything, organisations should adopt a modular approach:

  • Benchmarking tools to understand model capability
  • Principle-based tests for fairness, explainability and robustness
  • Red-teaming tools for stress testing and safety evaluation

Most comprehensive evaluations require combining 2–4 toolkits. For example, organisations may combine Moonshot or AIR-Bench for benchmarking, AI Verify, Fairlearn, or AIF360 for principle-based testing, and PyRIT, Giskard, or Cyberseceval for red-teaming and adversarial risk assessment. No single toolkit provides complete coverage, but together they offer a more holistic evaluation across capability, fairness, robustness, safety, and security dimensions.

4. Combine results into an integrated governance workflow

This “multi-tool” approach must be tied back to organisational governance:

  • Evaluation documentation
  • Risk registers
  • Model cards
  • Compliance mapping to NIST, ISO 42001, IMDA frameworks
  • Continuous monitoring dashboards

This ensures evaluation insights become operational safeguards.
 

Why NCS?

NCS consultants specialise in AI governance, safety and risk assessment. Our teams are trained across leading global frameworks, including NIST AI RMF, ISO 42001, Singapore’s Model AI Governance Frameworks, and sector-specific guidelines.

Combined with deep experience across the major open-source and commercial toolkits, we help organisations:

  • Select the right mix of evaluation tools
  • Interpret results clearly for business stakeholder
  • Build lifecycle-based AI governance workflows
  • Strengthen risk management at data, model and system levels
  • Monitor emerging vulnerabilities in today’s fast-evolving AI landscape

Our ongoing R&D collaboration with NUS also enables us to advise on emerging threats, attack vectors and new evaluation methodologies before they reach mainstream adoption.
 

A clearer path to evaluating AI safely and responsibly

The AI evaluation landscape is complex and fragmented; no single toolkit can provide complete coverage across the AI lifecycle. By adopting a structured, lifecycle-based approach and combining the right tools for the right purpose, organisations can build more trusted, resilient and compliant AI systems.

The accompanying infographic provides a condensed visual comparison of the major toolkits and highlights key gaps to be aware of.
 

See the AI toolkit landscape at a glance

A simple, visual overview of today’s AI evaluation tools, what they check, what they miss and how to use them together for safer AI.



References

AI Verify Foundation. (2024). Project Moonshot: An LLM evaluation toolkit. https://aiverifyfoundation.sg/project-moonshot/

AI Verify Foundation. (2024). Proposed model governance framework for generative AI. https://aiverifyfoundation.sg/resources/mgf-gen-ai/#proposed-model-governance-framework-for-generative-ai

AI Verify Foundation. (2024). What is AI Verify? https://aiverifyfoundation.sg/what-is-ai-verify/

AI Verify Foundation & Infocomm Media Development Authority. (2024). Model AI governance framework for generative AI (May 2024). https://aiverifyfoundation.sg/wp-content/uploads/2024/05/Model-AI-Governance-Framework-for-Generative-AI-May-2024-1-1.pdf

Curry, R. (2024, May 16). The biggest risk corporations see in gen AI usage isn’t hallucinations. CNBC. https://www.cnbc.com/2024/05/16/the-no-1-risk-companies-see-in-gen-ai-usage-isnt-hallucinations.html

Cyber Security Agency of Singapore. (2024). Guidelines and companion guide on securing AI systems. https://www.csa.gov.sg/resources/publications/guidelines-and-companion-guide-on-securing-ai-systems/

EleutherAI. (n.d.). EleutherAI research and open-source LLM projects. https://www.eleuther.ai/

Giskard. (n.d.). Giskard: The AI testing framework. https://github.com/Giskard-AI/giskard

IBM. (n.d.). AI Explainability 360 (AIX360). https://github.com/Trusted-AI/AIX360

IBM. (n.d.). AI Fairness 360 (AIF360). https://github.com/Trusted-AI/AIF360

International Organization for Standardization. (2023). ISO/IEC 42001: Artificial intelligence—Management system. https://www.iso.org/standard/81230.html

Meta. (n.d.). Cyberseceval: LLM cybersecurity benchmarks. https://github.com/meta-llama/PurpleLlama

Microsoft. (n.d.). Fairlearn. https://fairlearn.org/

Microsoft. (n.d.). PyRIT: Python Risk Identification Tool for generative AI. https://github.com/Azure/PyRIT

MLCommons. (n.d.). AI safety benchmarks. https://mlcommons.org/benchmarks/

Monetary Authority of Singapore. (2023). Toolkit for responsible use of AI in the financial sector. https://www.mas.gov.sg/news/media-releases/2023/toolkit-for-responsible-use-of-ai-in-the-financial-sector

National Institute of Standards and Technology. (2023). AI risk management framework. https://www.nist.gov/itl/ai-risk-management-framework

Personal Data Protection Commission (PDPC). (2022). Model AI governance framework. https://www.pdpc.gov.sg/help-and-resources/2020/01/model-ai-governance-framework/

Stanford Center for Research on Foundation Models. (n.d.). HELM: AIR-Bench. https://crfm.stanford.edu/helm/air-bench/latest/


Share this article on:

Read more

Speak to our AI governance specialists

They can assess your organisation’s current evaluation approach and identify the right mix of tools, controls and lifecycle processes to strengthen the safety and trustworthiness of your AI systems.

Contact us

what are you looking for?

Contact Us

You can drop us a call or email

6556 8000
We endeavour to respond to your email as soon as possible. When sending in an enquiry, please fill your contact details and indicate the request purpose for our follow-up.

Thank you for your enquiry! We'll get back to you as soon we can.

Thank you for your interest.