As AI systems increasingly influence real-world outcomes, the stakes for getting them right have never been higher. Yet generative AI introduces unpredictable failures that traditional testing approaches struggle to catch. At the same time, regulators are raising the bar for accountability. Leaders must therefore rethink how they validate and govern AI systems. This article introduces a practical framework for assessing trustworthy AI, and what organisations must do to detect, manage, and mitigate AI failures before they impact users.
Key takeaways
- Trustworthy AI is now a business and regulatory imperative.
- Traditional software testing approaches are insufficient for generative AI because they don’t fail in predictable ways.
- (What are we introducing) The risk-based AI assurance framework can provide better coverage/insight onto a trustworthy AI system while also addressing regulatory requirements.
- AI assurance must be continuous and embedded across the AI lifecycle.
Why traditional testing is no longer enough
Regulators are looking for independent, verifiable evidence that a specific system is safe and appropriate for its intended use.
- Vendor claims and subjective satisfaction are not enough because they cannot be audited or held accountable.
- Traditional software testing methods are inadequate – they assume deterministic behaviour: same input, same output.
- Gen AI is fundamentally non-deterministic: a chatbot may answer correctly 95 times and hallucinate on the 96th, with no change in tone or confidence.
- Point-in-time testing cannot capture or mitigate these failures, living critical risks undetected
Given these limitations, it’s clear that a new approach is needed. That’s where a risk-based AI Assurance Framework comes in to provide the independent, structured, and ongoing oversight required to ensure trustworthy and accountable AI deployments
The risk-based AI assurance framework
There are 3 essential capabilities in risk-based AI assurance that conventional testing lacks:
- Independence. The team that built the system should not be the sole judge of whether it is safe. Independent testers focus on uncovering failures rather than confirming expected outcomes.
- Risk-driven prioritisation. Not every error has the same impact. A chatbot using an overly informal tone is far less serious than one fabricating government programme details. Assurance efforts should prioritise areas where the risks are highest.
- Evidence consolidation. Risk-based assurance generates clear, auditable records—such as risk registers, test results, and evidence packs—that give deployment owners, governance teams, and regulators insight. Without these, testing results remain hidden from decision-makers.
The framework in action: operationalising trust
To prove this works beyond theory, NCS and AIQURIS co-developed an independent AI assurance framework under IMDA’s AI Verify Foundation Assurance Sandbox. We stress-tested this framework on a high-impact career guidance chatbot in Singapore. Because the AI influences critical employment decisions for graduates and displaced workers, we applied rigorous governance to its complex processing to ensure its recommendations remain safe and accurate.
To prove this works beyond theory, NCS and AIQURIS co-developed an independent AI assurance framework under IMDA’s AI Verify Foundation Assurance Sandbox. We stress-tested this framework on a high-impact career guidance chatbot in Singapore. Because the AI influences critical employment decisions for graduates and displaced workers, we applied rigorous governance to its complex processing to ensure its recommendations remain safe and accurate.
How it works
The idea is simple – a closed-loop, risk-based and evidence-backed assurance process.

Figure 1. The assurance process
- Define the scope and identify risks: The team documents application objectives, user populations, and constraints. Potential failures modes are identified and ranked by impact.
- Test under production-like conditions: We use NCS SPAR-based simulations to generate multi-turn conversations that probe risk scenarios.
- Evaluate: LLM-as-a-judge evaluation scores responses against structured rubrics. (See Fig 2)
- Results: The results are consolidated into evidence packs for regulators and deployment owners, which will then inform deployment decisions. (See Fig 3)

Figure 2. NCS Simulation based evaluation and risk scoring methodology

Figure 3. Example of a Risk Profile generated by AIQURIS
The assurance framework establishes a reusable deployer-assurer governance model for high-impact conversational AI. It gives deployment teams auditable evidence to support high-stakes decisions, reducing risks and increasing trust and safety in citizen-facing AI systems.
Future-proof your AI
Future-proofing AI requires better guardrails, not just better models. Organisations that invest in structured AI assurance now will navigate the shifting regulatory landscape with ease. The question is no longer if independent assurance is necessary, but whether your organisations will have the framework and capability in place before your next deployment demands it.
Download