Validating AI Systems for Model Risk Management

Introduction

Artificial Intelligence is transforming how organisations operate, reshaping processes, accelerating analysis, and enabling new forms of insight generation. As AI systems move from experimentation into core infrastructure, institutions must establish robust risk management frameworks to ensure effective oversight of model performance through validation, monitoring, and accountability. Existing model risk practices need to evolve to address the behavioural and operational characteristics of both Generative and Agentic AI models, while maintaining trust and regulatory compliance to support sound decision-making.

Regulatory expectations are converging across EU and UK jurisdictions, requiring AI systems to meet standards of explainability, traceability, governance, and human oversight. Here we propose a model risk framework which is designed meet these expectations by embedding behavioural assurance, auditability, and control mechanisms, enabling institutions to scale AI models safely and with confidence.

The framework presented here follows a structured, end-to-end validation approach covering data quality and safety, behavioural testing, output evaluation, human-in-the-loop, continuous monitoring, and remediation mechanisms.

The New Generation of AI Systems

Machine learning models have traditionally focused on statistical prediction, estimating probabilities, classifying outcomes, or detecting patterns in structured data. These models typically produce numerical outputs and are evaluated primarily through quantitative performance metrics. Modern AI systems, however, extend beyond predictive modelling. They combine large language models (LLM) with capabilities such as reasoning, retrieval of information, planning, and interaction with external tools. As a result, these systems behave more like analytical assistants than traditional statistical models. For the purposes of this article, we focus on these AI systems, which typically consist of two key components:

Generative AI: systems used to produce narrative reasoning and explanation, and;
Agentic AI: systems that can pursue goals, make decisions over multiple steps, use tools, retrieve information, and (in some cases) act with a degree of autonomy to accomplish tasks.

Generative AI

Generative AI represents the foundational layer of modern AI capabilities. These models excel at producing coherent, high-quality narratives: summarising documents, reorganising evidence, rewriting analyst notes, and structuring complex behaviours into clear explanations. Their strengths lie in interpretation and documentation, not in executing tasks such as performing calculations, or inferring missing data. They enhance human efficiency but do not replace human judgment.

Agentic AI

Agentic AI expands on Generative AI by introducing an ability to design structured plans, identify and utilise supporting tools and thus replace human input in the execution of sequential tasks and inferring related conclusions. This introduces additional complexities from a model risk management perspective, including setting boundaries on what the model can do, governing what tools should be made available, and checking outcomes for accuracy.

Thus while Generative AI and Agentic AI are distinct system types, for validation purposes Generative AI can be treated as the zero autonomy baseline to which agentic decision-making and execution capabilities are added.
Several components of agentic systems commonly appear in financial services workflows, each bringing distinct behaviours and risks that must be validated. These include:

Planning agents which break tasks into structured steps and sequence actions.
Retrieval agents that locate information from documents or databases.
Tool-used agents that interact with calculators, APIs, or internal systems to perform actions.
Routing or orchestration agents which decide which tools or workflows to invoke, while verification agents check outputs and reasoning.

In combination, these agents can execute multi step processes end to end, explicitly carrying forward context, intermediate outputs, and decisions from one step to the next so the overall workflow remains coherent and traceable.

The AI Risk Landscape

Modern AI systems introduce a range of risks that depend on how they generate outputs and, in some cases, how they are designed to act with autonomy. To make these risks easier to assess and control, it is useful to group them into categories that best reflect where failures are most likely to arise. The risk types below provide a practical way to frame validation and ensure testing remains proportionate to how the system is built, how it behaves, and how it is used.

Design and implementation risk

Design and implementation risk describes situations where weaknesses in how the system is built or configured manifest as unsafe or unintended behaviour at runtime. This includes inappropriate autonomy settings, flawed workflow design, unsuitable tooling, architectural mis design, or inadequate prompt and guardrail configuration, all of which can lead to outcomes that fall outside intended or policy compliant behaviour.

Core risk

Core risks are common to both Generative and Agentic AI models and reflect fundamental failure modes inherent to probabilistic, data driven systems. These risks arise regardless of system autonomy or tool use and therefore form the baseline risk layer applicable to all AI deployments.

The core risk categories typically include:

Factual Integrity Risk of unsupported, unverifiable, false, or invented statements.
Reasoning Integrity Risk of causal gaps, flawed logic, missing steps, or incoherent reasoning.
Consistency Risk of contradictory or internally inconsistent outputs across answers or runs.
Stability & Drift Risk that behaviour changes across runs, model updates, or small input variations.

User Overreliance Risk (governance) that fluent narrative leads users to trust AI outputs without adequate HITL.

Agentic specific risk

Agentic AI introduces additional risks due to its workflow driven, goal oriented nature and its ability to act with increased autonomy. Unlike purely generative systems, these risks arise from the system’s capacity to plan, make intermediate decisions, invoke tools, and execute actions with limited human intervention.

As a result, Agentic AI gives rise to additional risk categories, including:

Planning Integrity Risk of invented, irrelevant, or unsafe steps in generated plans.
Workflow Coherence Risk of incorrect sequencing, dependency errors, or broken multistep logic.
Tool‑Use Safety Risk of unsafe or incorrect tool/API selection, parameters, or misuse.
State Integrity Risk of corrupted, lost, or contaminated intermediate state across steps.
Retrieval Integrity Risk of wrong source selection, mis‑grounding, or unstable retrieval behaviour.
Auditability & Traceability Risk that plans, reasoning, or tool interactions cannot be reproduced or traced.
Guardrail & Autonomy Risk that the agent exceeds permitted autonomy, bypasses constraints, or performs unsafe actions.

The scale and complexity of these risks, versus those of a traditional predictive model, require the design of an enhanced model validation framework.

An AI validation framework

The complexities of AI systems introduce behavioural risks that traditional validation frameworks are not designed to address. For this reason, AI assurance frameworks require a set of additional complementary validation components:

Data quality and safety- this step ensures the AI receives safe, complete, and policy compliant inputs before any validation begins. For Generative AII and Agentic systems, inputs include prompts, conversation history, retrieved evidence, and system instructions.
Behavioural testing- which assesses whether Generative AI and Agentic AI systems behave with appropriate discipline and control in practice. This includes how consistently the system reasons, how reliably it grounds outputs in available evidence, how it responds when information is missing or contradictory, and whether guardrails remain effective over time. For systems with agentic capabilities, behavioural testing also considers autonomy boundaries, routing decisions between components, and the safe use of tools.
Output evaluation- which evaluates the quality of what the model produces relevance, completeness, factual accuracy, clarity, tone, and the degree of human refinement required.
Human-in-the-Loop (HITL)- applied after Output Evaluation to provide human judgment and accountability for high impact outputs where responsibility cannot be delegated to AI.
Continuous monitoring- providing ongoing tracking of drift, hallucination patterns, retrieval failures, instability in planning and other behavioural changes over time
Remediation mechanism- where even with strong controls, Generative and Agentic AI systems require continuous remediation due to their dynamic nature. Issues can arise at any stage, so remediation acts as an ongoing feedback loop where weaknesses trigger targeted adjustments such as prompt refinement, model tuning, and guardrail updates ensuring the system remains stable, safe, and aligned with validation expectations

All framework components apply to both Generative and Agentic AI systems. Where a system introduces autonomy or tool use, the behavioural testing component is applied more stringently, with additional checks to address the elevated risks these capabilities introduce. The same assurance workflow applies consistently across the lifecycle, with remediation triggered whenever validation findings, output issues, or monitoring signals indicate the need for corrective action.

Figure 1. end to end validation framework for Generative and Agentic AI, spanning input data quality and safety, behavioural testing, output evaluation, human in the loop oversight, continuous monitoring, and risk driven remediation across the lifecycle.

1.

Data quality and safety

Data Quality & Safety checks assess whether inputs are complete, well formed, relevant to the intended task, and compliant with internal policy and usage constraints ensuring that inputs do not contain prohibited, unsafe, or inappropriate content, or request actions or access outside the system’s permitted scope.

2.

Behavioural testing

Behavioural testing focuses on whether an AI system behaves safely, predictably, and consistently across different conditions, rather than assessing the quality of any single output in isolation. This includes assessing its reasoning, the reliability of its grounding in available evidence, the consistency of refusal behaviour when information is missing or contradictory, and where agentic capabilities are present how the system plans, sequences actions, and uses tools to progress towards defined objectives.

Behavioural testing is applied under a range of controlled stress conditions, such as incomplete information, contradictory evidence, repeated executions, or adversarial pressure. These conditions do not define pass or fail outcomes themselves. Instead, they are used to surface behavioural weaknesses and to distinguish isolated output issues from systematic behavioural risks that may only emerge under stress.

In more complex architectures, behavioural risk may arise not only within a single decision flow, but also from interactions between multiple agents. Where multi agent systems are used, behavioural testing extends to assessing agent hand offs, routing decisions, coordination between agents, and the stability of outcomes across shared workflows.

Implementation choices such as document chunking strategies, metadata design, and access controls are not treated as separate validation pillars. Their relevance arises through their behavioural impact. Where these design choices materially affect performance, they are explicitly assessed through behavioural testing and output evaluation.

Scaling behavioural testing with system complexity

Behavioural testing is applied proportionately rather than uniformly. The depth and breadth of behavioural testing scale with the system’s autonomy and risk profile:

Core tests apply to all Generative and Agentic AI systems. These confirm that reasoning is logical and evidence based, detect hallucinations and behavioural drift, assess stability across repeated runs, and verify that guardrails trigger safe refusals when inputs are incomplete, contradictory, or out of scope.
Retrieval dependent tests apply when Retrieval Augmented Generation is used. These assess retrieval integrity, ensuring that correct sources are selected, cited faithfully, used without invention, and that retrieval behaviour remains stable across executions.
Agentic dependent tests assess whether the system selects and invokes tools appropriately and within permitted boundaries, follows correct routing and escalation paths, detects invented or irrelevant steps in plans, and maintains coherent multi step workflows.
Strengthening tests are introduced for higher risk or autonomy capable systems. These include adversarial stress testing, regulatory alignment checks, causal coherence probes that test the stability of explanations, and confidentiality checks to ensure sensitive information is not disclosed under pressure.

3.

Output evaluation

Output Evaluation focuses on the quality, grounding, completeness, and professional suitability of individual outputs.

For Agentic AI systems, evaluation also includes the safety and correctness of proposed actions or workflows. The level of human refinement required serves as a practical indicator of output reliability.

To ensure that AI generated narrative is not only behaviourally safe but also analytically usable, each output must undergo a set of targeted quality checks that assess its relevance, clarity, accuracy, and professional readiness:

Relevance assessment – Confirms that the narrative directly addresses the monitoring objective, question, or analytical requirement. Assess for Irrelevant or off-scope text which is indicative of reasoning drift.
Clarity and structural coherence check – Evaluates whether the output is easy to follow, logically ordered, and free from ambiguous or overly complex phrasing.
Factual accuracy review – Ensures all statements are correct, verifiable, and grounded in the provided evidence. Any unsupported claim indicates a grounding failure.
Completeness scan – Checks whether the narrative covers all required elements of the task without omissions.
Tone and professionalism check – Confirms the output is written in a neutral, regulator-ready tone suitable for senior stakeholders.
Editing effort score – Measures how much human correction is required before the output can be finalised, high editing effort highlights quality or reasoning issues.

In practice, institutions conduct Output Evaluation through a combination of automated routines and structured human review, with a clear distinction between mechanical checks and those that require judgement.

Mechanical checks are used where objective comparison is possible. For example, verifying whether factual claims are supported by retrieved evidence, checking consistency against known reference data, confirming that required sections are present, or detecting obvious off scope content can be performed automatically and consistently at scale.

Human judgement is applied where assessment depends on context, nuance, or intended use. This includes evaluating whether the reasoning is sufficiently clear and persuasive, whether the narrative appropriately addresses conflicting evidence, whether tone and framing are suitable for regulatory or senior management audiences, and whether the output is fit for its intended analytical or supervisory.

4.

Human-in-the-Loop (HITL)

HITL introduces explicit human judgement as a formal assurance checkpoint before outputs are relied upon, ensuring accountability for high impact decisions remains with experts rather than the AI system.

HITL review is not applied by default. It is triggered only where outputs are considered materially impactful or regulatory sensitive, where predefined risk thresholds are breached, or where ambiguity remains unresolved following automated and bounded machine assisted checks. Typical examples include outputs that influence material financial decisions, regulatory reports, senior management decisions, or changes to key risk inputs.

The objective of HITL is to maintain clear human accountability, prevent over reliance on AI in material decisions, and provide a safeguard against residual reasoning or grounding errors before outputs are formally adopted, while allowing monitoring and remediation activities to continue across the wider AI lifecycle.

5.

Continuous monitoring

Generative and Agentic AI systems operate in dynamic environments where inputs, usage patterns, and execution context evolve over time. Continuous Monitoring provides ongoing oversight to ensure that system behaviour remains within the bounds established during validation. It acts as a safeguard that complements, rather than replaces, formal validation by detecting behavioural drift under real operating conditions.

In practice, monitoring tracks a defined set of behavioural metrics, such as rates of unsupported claims, changes in reasoning patterns, retrieval stability, refusal behaviour under incomplete or contradictory inputs. These metrics are evaluated against predefined ranges and thresholds that reflect the institution’s risk appetite, with clear distinctions between acceptable behaviour, emerging concern, and unacceptable deviation.

Continuous monitoring aims to detect when behaviour begins to drift outside validated ranges under real operating conditions. Monitoring is performed at defined intervals and following material changes to prompts, underlying models, retrieval configuration, autonomy settings, or execution context.

Continuous Monitoring extends established model risk management practices to account for the dynamic and adaptive nature of modern AI systems. It provides confidence that validation conclusions remain reliable over time, while ensuring that behavioural changes are detected early and addressed before they lead to material impact.

6.

Remediation mechanism

Even with strong controls, Generative and Agentic AI systems will require periodic correction. Behavioural variability, retrieval dependencies, and autonomous execution paths mean that issues can surface at any point in the lifecycle. A remediation mechanism therefore operates as a continuous feedback loop, ensuring that every weakness identified from data intake to post deployment monitoring leads to targeted, traceable adjustments.

At the input stage, failed Data Quality & Safety checks (e.g., unsafe content, incomplete inputs, inconsistent retrieval evidence) trigger remediation through updated prompt constraints, revised grounding rules, or improved retrieval configuration to ensure safe, policy aligned inputs before validation proceeds.
During model validation, behavioural findings map directly to corrective actions. Hallucination or drift signals require prompt refinement; stability issues may require model choice adjustments; retrieval failures lead to improved chunking or scoring; and unsafe tool or autonomy behaviours are corrected through revised tool permissions, fallback paths, or step limit settings. These adjustments restore predictable and auditable AI behaviour.
Following output evaluation, remediation focuses on improving narrative quality. High editing effort, missing reasoning steps, or unclear structure are addressed by refining examples, audience instructions, and persona settings within prompts, ensuring outputs meet analytical and supervisory expectations before HITL review.
Within HITL, repeated human corrections become explicit remediation signals. Patterns of overrides or escalations inform updates to prompts, guardrails, or autonomy limits so that issues addressed manually do not re emerge in future outputs.
In continuous monitoring, drift alerts, retrieval instability, or changes introduced by model updates automatically trigger remediation workflows. These include meta-prompt updates, iterative regression testing, and realignment of refusal logic and grounding rules.

Across all stages, remediation relies on a consistent set of control levers, including prompt adjustments, model selection and tuning, retrieval improvements, guardrail updates, tool use configuration, and refinements to autonomy rules.

Each failed test or observed anomaly is mapped to one or more of these controls and addressed through targeted corrective action.

Remediation is explicitly risk driven and evidence based, rather than scenario or judgement based. Issues are not considered resolved through manual override or subjective sign off alone. Closure occurs only once corrective actions have been implemented, and re validation confirms that the underlying behavioural risk no longer reproduces within defined thresholds. This approach ensures that remediation strengthens the system in a durable way, prevents recurrence under similar conditions, and maintains a clear audit trail linking findings, actions, and outcomes.

Next steps for institutions

As Generative and Agentic AI systems become embedded in decision-making and reporting, institutions must ensure they behave predictably, within scope, and produce verifiable outputs. Implementing a Model Risk Management framework with a well-calibrated AI validation approach will support the development of more robust, efficient, and reliable AI systems.

Firms that invest in strong governance, accountability, and independent challenge gain clear advantages: better performance, explainability, audit readiness, and fewer incidents. Embedding behavioural testing, HITL oversight, continuous monitoring and robust remediation mechanism enables safe, scalable AI adoption for future proofing of business models.

View all industries

Advisory

Audit

Tax

Careers

Why Grant Thornton

Connect

Search dialog

Validating Artificial Intelligence Systems

Introduction

Contents

The New Generation of AI Systems

Generative AI

Agentic AI

The AI Risk Landscape

Design and implementation risk

Core risk

Agentic specific risk

An AI validation framework

Data quality and safety

Behavioural testing

Scaling behavioural testing with system complexity

Output evaluation

Human-in-the-Loop (HITL)

Continuous monitoring

Remediation mechanism

Next steps for institutions

About

Connect

Legal

Why Grant Thornton

Search dialog

Validating Artificial Intelligence Systems

Introduction

Contents

The New Generation of AI Systems

Generative AI

Agentic AI

The AI Risk Landscape

Design and implementation risk

Core risk

Agentic specific risk

An AI validation framework

Data quality and safety

Behavioural testing

Scaling behavioural testing with system complexity

Output evaluation

Human-in-the-Loop (HITL)

Continuous monitoring

Remediation mechanism

Next steps for institutions

Meet our risk management experts