FirmAdapt
FirmAdapt
LIVE DEMO
Back to Blog
AI complianceregulatoryfinancial servicesbankingcomplianceSR 11-7

SR 11-7 Model Risk Management for Generative AI: A Working Framework

By Basel IsmailMay 5, 2026

SR 11-7 Model Risk Management for Generative AI: A Working Framework

SR 11-7 turned thirteen this year. The Federal Reserve and OCC issued it in April 2011 to formalize how banks should think about model risk, and it remains the governing framework for model risk management (MRM) at supervised institutions. It was written with linear regression, VaR models, and credit scorecards in mind. But the language is deliberately model-agnostic, and the Fed has been clear that it expects institutions to apply SR 11-7 to any model that informs a business decision, generates reports, or produces outputs that influence risk-taking.

If your institution is deploying a large language model for anything beyond internal experimentation, SR 11-7 applies. Full stop. The OCC reinforced this in its September 2023 semiannual risk perspective, flagging generative AI as an emerging risk requiring "sound risk management practices consistent with existing guidance." Translation: we are not writing new rules for you; use the ones you already have.

The challenge is that SR 11-7's framework was designed around models with quantifiable inputs, deterministic outputs, and backtestable performance. LLMs have none of those properties in the traditional sense. So how do you document a generative AI system in MRM terms without either trivializing the framework or making it unworkable?

Start With the Definition of "Model"

SR 11-7 defines a model as "a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates." The key question for your MRM team is whether an LLM fits this definition. It does, but you need to articulate why for your specific use case.

An LLM used to summarize regulatory filings is processing input data and producing estimates (of what a good summary looks like, essentially). An LLM used to draft customer communications in a lending context is producing outputs that directly affect consumer-facing decisions. An LLM classifying transaction narratives for BSA/AML monitoring is functioning as a component model within a broader quantitative system. Each of these triggers SR 11-7.

Where it gets interesting is the "quantitative estimates" language. Some compliance teams have tried to argue that because LLM outputs are text, not numbers, they fall outside the definition. This argument will not survive an exam. The Fed's 2011 guidance explicitly includes "qualitative" outputs when they inform quantitative decisions, and the OCC's 2023 commentary makes clear that the substance of the output matters more than its format.

The Three Pillars, Applied to LLMs

SR 11-7 organizes model risk management around three activities: model development and implementation, model validation, and model use and governance. Here is how each one maps to a generative AI deployment.

1. Model Development, Implementation, and Documentation

For a traditional model, your model development documentation would include the theoretical basis, variable selection rationale, estimation methodology, and limitations. For an LLM, you need analogous documentation even though you did not build the foundation model yourself.

  • Model card or equivalent: Document the foundation model (GPT-4, Claude, Llama, etc.), its version, its known capabilities and limitations, and any fine-tuning or retrieval-augmented generation (RAG) architecture layered on top.
  • Prompt engineering as methodology: Your system prompts, few-shot examples, and prompt templates are your "model specification." Version-control them the same way you would version-control model code. Changes to prompts should go through change management.
  • Data lineage: If you are using RAG, document the corpus. What documents feed the retrieval layer? How current are they? Who maintains them? This is your equivalent of training data documentation.
  • Limitations and assumptions: Be explicit about hallucination risk, context window constraints, sensitivity to prompt phrasing, and the model's tendency to produce confident-sounding but incorrect outputs. Examiners will want to see that you understand these failure modes.

2. Model Validation

This is where most institutions get stuck. Traditional validation involves backtesting, benchmarking, sensitivity analysis, and outcomes analysis. LLMs do not lend themselves to backtesting in the classical sense, but the principles still apply.

  • Outcomes testing: Define what a "correct" output looks like for your use case and measure against it. If the LLM is classifying transactions, you can compute precision, recall, and F1 scores against human-labeled data. If it is summarizing documents, you need a structured human review process with inter-rater reliability metrics.
  • Sensitivity analysis: Test how outputs change when you rephrase inputs, alter prompt structures, or swap document contexts. LLMs can be surprisingly brittle to minor input variations, and your validation team needs to characterize that instability.
  • Benchmarking: Compare LLM outputs against a challenger approach. This could be a different model, a rules-based system, or human performance. The point is to demonstrate that you evaluated alternatives.
  • Ongoing monitoring: Foundation model providers update their models, sometimes without notice. OpenAI's January 2024 update to GPT-4 Turbo changed output behavior in ways that affected downstream applications. Your monitoring framework needs to detect performance drift even when you have not changed anything on your end.

One practical note: SR 11-7 requires "effective challenge" from a validation team independent of the development team. For LLM deployments, this means your second-line MRM validators need enough technical fluency to actually interrogate the system. If your validation team has never worked with an API-based model or does not understand tokenization and temperature settings, you have a staffing gap that examiners will notice.

3. Governance and Controls

SR 11-7 requires a model inventory, tiered risk ratings, usage controls, and board-level reporting. For LLMs, a few specific governance considerations stand out.

  • Tiering: Not every LLM use case is high risk. An internal tool that helps analysts draft first versions of research summaries is different from an LLM that generates consumer disclosures. Tier accordingly, and allocate validation resources based on the tier.
  • Usage boundaries: Document explicitly what the model is approved to do and, just as importantly, what it is not approved to do. If the LLM is approved for internal summarization but not for customer-facing output, enforce that boundary technically, not just through policy.
  • Vendor risk: If you are using a third-party API, SR 11-7's expectations do not diminish because you outsourced the model. The OCC's 2013 third-party risk management guidance (OCC Bulletin 2013-29, updated in 2023) layers on top. You are responsible for understanding and managing the model risk even if you did not train the model.
  • Audit trail: Log inputs, outputs, and any human overrides. For regulated decisions, you need to be able to reconstruct why a particular output was generated. This is both an SR 11-7 requirement and a practical necessity for fair lending and UDAAP compliance.

What Examiners Are Actually Looking For

Based on recent OCC and Fed examination priorities, the bar is not "did you deploy AI responsibly in some abstract sense." The bar is "can you show me the model inventory entry, the validation report, the ongoing monitoring metrics, and the board reporting for this specific system." Examiners have been trained on SR 11-7 for over a decade. They know what a complete MRM file looks like, and they will expect the same rigor for your LLM deployment as they would for your CECL model or your interest rate risk model.

The institutions getting ahead of this are the ones treating LLM governance as an MRM problem from day one, not retrofitting compliance after deployment. The $30 million consent order Citibank received in 2020 for deficient model risk management (unrelated to AI) is a useful reminder that the Fed takes SR 11-7 seriously and has real enforcement tools.

How FirmAdapt Addresses This

FirmAdapt's architecture was built around the assumption that every AI output in a regulated environment needs to be auditable, version-controlled, and explainable. The platform maintains complete logging of inputs, outputs, prompt versions, and model configurations, which maps directly to SR 11-7's documentation and audit trail requirements. Retrieval sources are tracked and timestamped, giving MRM teams the data lineage documentation they need for validation.

For institutions building their LLM model inventory and validation framework, FirmAdapt provides the infrastructure to generate the artifacts examiners expect: versioned model documentation, ongoing performance metrics, and usage boundary enforcement at the technical layer. The goal is to make SR 11-7 compliance a built-in feature of the deployment rather than a separate workstream that runs in parallel and inevitably falls behind.

Ready to uncover operational inefficiencies and learn how to fix them with AI?
Try FirmAdapt free with 10 analysis credits. No credit card required.
Get Started Free
SR 11-7 Model Risk Management for Generative AI: A Working F | FirmAdapt