When we first added AI workloads to our regulated infrastructure, the audit conversation was harder than the technical deployment. Auditors had questions we had not anticipated. Some questions we answered well. Some questions exposed gaps in our documentation. A few questions led to remediation projects that took months.
This article documents the questions that came up across multiple audit cycles — PCI DSS, ISO 27001, and regulatory inspections specific to financial services. The patterns generalize beyond banking, but my context is regulated fintech operations.
I am writing this from the auditee side — the person responsible for explaining the environment to auditors, providing evidence, and remediating findings. Not from the auditor side. The perspective matters because what auditors ask and what auditees expect are often different. Bridging that gap is most of the work.
What follows is structured around the actual questions we received, organized by audit area, with the answers that worked and the documentation that supported them. Names, dates, and specific findings are anonymized. The patterns are real.
Why AI infrastructure triggers audit attention
Before getting to the questions, context on why AI workloads receive elevated audit scrutiny in regulated environments.
Auditors care about predictability and controllability. Traditional enterprise workloads (databases, application servers, VDI) have decades of audit precedent. Auditors know what questions to ask, what evidence looks good, and what findings are acceptable.
AI workloads are different in several ways auditors notice:
- New attack surface: GPU drivers, AI frameworks, model serving infrastructure — all new code paths in production
- Different data flows: Training datasets, model artifacts, inference logs — new data classes with different handling requirements
- Vendor concentration: NVIDIA's CUDA, drivers, frameworks create supply chain dependency
- Compute power: Large GPU clusters are valuable targets and have specific physical security implications
- Output verification: AI inference outputs may affect business decisions, raising integrity questions
- Regulatory uncertainty: AI-specific regulations (EU AI Act, sector-specific guidance) are evolving
Auditors recognize these as new risk surfaces and probe accordingly. The questions get harder when traditional control frameworks don't map cleanly to AI infrastructure.
The good news: most questions can be answered with disciplined documentation and architectural choices. The teams that struggle are usually those that deployed AI without integrating it into existing compliance frameworks.
Pre-deployment: what they asked before we built anything
The first audit conversation happened before any AI hardware was racked. This was an architecture review with our internal compliance team and external auditor representatives.
Question 1: "What is the business case, and what regulated data will be involved?"
This question seems administrative but is critical. It scopes everything that follows.
Our answer: "AI workloads will support fraud detection, customer service automation, and operational efficiency. Training data includes transaction patterns (regulated under PCI DSS), customer communication logs (regulated under privacy laws), and operational telemetry (less sensitive). Production inference will not modify customer-facing data directly — outputs are advisory to existing systems."
What worked: clear separation of data classes upfront. Auditors understood from day one which data flows would touch regulated systems.
What we should have done better: defined "advisory to existing systems" more precisely. We later spent time clarifying what "advisory" means in practice — is the AI output a recommendation a human reviews, or does it trigger automated actions? Different answers have different control implications.
Question 2: "How does AI infrastructure integrate with your existing compliance architecture?"
Auditors wanted to understand whether we were creating a parallel environment or extending existing controls.
Our answer: "AI workloads will run on the same infrastructure platform as banking workloads, with storage policy and network isolation enforcing separation. This extends our existing controls rather than creating parallel ones. Audit logging, access controls, change management, and incident response procedures all apply uniformly."
What worked: integration vs separation is a binary choice with major audit implications. We chose integration with explicit isolation controls. The alternative (fully separate AI environment with its own controls) would have been simpler architecturally but more expensive to operate and audit.
What we should have done better: prepared more detailed control mapping. Showing exactly which existing controls applied to AI workloads, with examples, would have shortened the architecture review by weeks.
Question 3: "What is your data classification approach for AI training data?"
This question was harder than expected. Our existing data classification was built around traditional banking data flows. AI training data created new questions.
Our answer evolved over several conversations:
- Training datasets that contain customer transaction data → classified at same level as the source data
- Aggregated/anonymized training data → classified one tier lower than source
- Synthetic training data → classified as internal
- Model artifacts derived from regulated data → classified as the highest tier of training input
- Inference logs → classified based on input data class
What worked: deriving classification rules from data lineage rather than treating "AI data" as a single category. The granularity made handling rules clearer.
What we should have done better: documented these rules formally before AI deployment, not during. We had to retrofit classification labels to existing training datasets, which took meaningful operations time.
Question 4: "Who has authority to approve AI workload deployments?"
Standard change management question, but with AI-specific implications.
Our answer: "Standard change management applies. AI workload deployments require: technical review (infrastructure team), security review (security team), data review (data governance), and business approval (workload owner). Production deployment requires Change Advisory Board approval."
What worked: AI did not get special expedited paths. Same approval process as other infrastructure changes.
What we should have done better: we initially had a separate "AI approval" track that was faster than standard CAB. This was flagged as a control gap (faster approvals for higher-risk workloads is inverted from typical practice). We consolidated to standard CAB and accepted the longer deployment timelines.
Network architecture questions
Network design is where the audit conversation gets technically detailed. Auditors trace data flows and ask about isolation enforcement at each hop.
Question 5: "Show me the network path from a banking transaction to AI inference and back. What boundaries does it cross, and how are they enforced?"
This is the textbook trace-the-flow question. Auditors expect a diagram.
Our diagram showed:
- Banking transaction originates in PCI scope
- Transaction event published to message queue (within PCI scope)
- AI inference service consumes event (within PCI scope, on isolated VLAN)
- Inference output published to separate result queue
- Banking system consumes result, applies business logic
- Audit log captures all steps
Each VLAN transition, each ACL rule, each authentication boundary was documented. Auditors asked specifically about:
- "What prevents the inference service from accessing customer accounts directly?"
- "Is the result queue authenticated, or can any service write to it?"
- "If the inference service is compromised, what can the attacker reach?"
Our answers depended on specific isolation controls being documented and tested. We provided:
- Network configuration showing VLAN definitions
- Firewall rules documenting allowed flows
- Authentication evidence for service-to-service communication
- Privilege analysis showing what AI workload accounts could and could not access
- Penetration test results validating isolation
What worked: comprehensive documentation prepared specifically for this question. We knew it would come, so we had answers ready.
What didn't work initially: our first diagram was at too high a level. Auditors wanted packet-flow detail, not architecture overview. We rebuilt the diagram with much more detail before the next audit.
Question 6: "How do you prevent AI workloads from accessing the internet for model downloads or framework updates?"
This question surprised us initially. The auditor was concerned about supply chain risk — AI frameworks pulling unverified updates from upstream sources.
Our answer: "AI workloads do not have direct internet access. All container images and model artifacts come from internal registries that mirror external sources after security review. Driver and framework updates follow our patch management process with full validation before production deployment."
The follow-up: "How do you ensure the internal mirror is current with security patches but doesn't pull in unreviewed changes?"
This required documenting our review process for updates: when does an external CVE trigger an internal update cycle, who reviews the changes, how are differences from upstream documented.
What worked: existing supply chain controls extended to AI artifacts. We did not need new processes, just explicit application of existing ones.
What needed work: documentation of the review process. We knew how it worked operationally but had not formalized it in writing. We documented the process formally during the audit cycle.
Question 7: "What about GPU firmware updates? How are those reviewed?"
Most audit teams have well-established processes for OS and application patches. GPU firmware is unfamiliar territory.
Our answer: GPU firmware (vBIOS, NVIDIA driver firmware components) follows the same patch management as server firmware:
- Updates trigger from vendor security advisories
- Test environment validation (minimum 2 weeks)
- Production deployment in maintenance windows
- Rollback procedures documented and tested
- All actions logged in change management system
What worked: applying existing firmware management process to GPU components rather than creating new procedures.
What we learned: GPU firmware updates have some specific quirks (driver version dependencies, container runtime compatibility) that operations team needs to track. We added a GPU-specific firmware compatibility matrix to our patch management documentation.
Identity and access management questions
IAM is always heavily audited. AI workloads added new categories of users and services to consider.
Question 8: "Who has administrative access to GPU resources, and how is that access controlled?"
The audit team wanted to understand the GPU operations team's privileges.
Our answer required careful documentation:
- GPU infrastructure team has admin access to NVIDIA GPU Operator, DCGM, vGPU configuration
- AI engineering team has user access to provisioned GPU resources via Kubernetes
- Application teams have workload-scoped access to specific GPU pools
- No team has admin access to both GPU infrastructure and the data flowing through it
The principle: separation of duties between platform operators (who run the infrastructure) and workload operators (who use the infrastructure).
Documentation provided:
- Role definitions for each team
- Privilege matrix showing what each role can access
- Quarterly access reviews
- Just-in-time access procedures for elevated privileges
- Privileged access workstation requirements for admin actions
What worked: leveraging existing IAM patterns. We did not invent AI-specific access models. Auditors recognized standard role separation patterns.
What needed work: we had not formalized the GPU operations team's role in our identity management system. Their access was implicit through general infrastructure team membership. We created explicit role definitions during the audit cycle.
Question 9: "How do AI engineers access training data, and is that access logged for compliance review?"
Training data access is a specific audit concern for two reasons: training data may include regulated information, and AI engineers often need broad access patterns that look concerning from compliance perspective.
Our answer: "AI engineers access training data through a controlled data lake interface. Access is logged at the query level. Datasets that contain regulated data require dataset-level approval before access is granted. Engineers cannot directly access source systems."
The follow-up: "Show me an example of an AI engineer's access request, the approval flow, and the resulting access log."
We provided sanitized examples of:
- Initial access request specifying the dataset and business purpose
- Data governance review of the request
- Approval workflow with timestamps and approvers
- Access provisioning notification
- First-day access logs showing the engineer using the access as approved
What worked: end-to-end paper trail for every access grant. Auditors could verify the process worked as documented.
What needed work: we had access logs but had not built a workflow for compliance team to review them periodically. Quarterly review now happens with documented evidence.
Question 10: "What happens to AI engineer access when they change roles or leave?"
Standard offboarding question with AI-specific implications.
Our answer: "Standard role change and termination procedures apply. AI-specific resources (model registry access, GPU cluster access, training data access) are integrated into our centralized identity management system. Access is removed automatically when the underlying role changes."
Auditors verified by sampling: pick a random terminated employee from the prior year, verify all AI-related accesses were removed within standard SLA.
What worked: centralized identity management. AI resources did not have independent access systems that could be missed during offboarding.
What needed work: training data access via temporary data shares was originally managed in a different system. Some shares persisted past role changes. We consolidated to a single access management system during the audit cycle.
Data protection questions
Data protection questions cut across encryption, retention, and lifecycle management.
Question 11: "How is training data encrypted at rest, and how is the encryption key managed?"
Standard encryption question, but with multiple layers in AI infrastructure.
Our answer covered:
- Training data on vSAN ESA uses storage-level encryption with per-policy keys
- Keys managed via external HSM with documented access controls
- Backup data encrypted independently with separate keys
- Key rotation annually, with rotation events logged
The follow-up: "Show me the key inventory. For each key, who has access and what is logged when that key is used."
This required pulling reports from our HSM. Sanitized examples showed:
- Key name, creation date, rotation date, expected rotation
- Roles authorized to use the key
- Sample audit log showing key usage
- Procedures for emergency key revocation
What worked: HSM-managed keys with comprehensive logging. Auditors could trace any encryption operation back to authorized usage.
What needed work: documentation of key lifecycle decisions. We rotated keys annually but had not documented why annual was the right cadence for our risk profile. We added formal key management policy documentation.
Question 12: "How are model artifacts protected? Models trained on regulated data have business value and may also contain training data fingerprints."
This question opened a complex conversation about model security.
Our answer: "Model artifacts are stored in encrypted artifact registries. Access to download models is logged and requires approval for production models. We classify models trained on regulated data at the highest level of training input."
The auditor asked: "How do you prevent model extraction attacks, where an attacker queries the inference API enough times to reconstruct the training data?"
This was a question we had thought about but not formally documented. Our answer:
- Rate limiting on inference APIs
- Query pattern monitoring (looking for systematic exploration)
- Differential privacy techniques applied to models trained on highly sensitive data
- Output minimization (returning only what is needed, not full probability distributions)
The auditor accepted this as reasonable mitigation, but flagged a finding for us to formalize a model security policy.
What worked: we had implemented technical controls correctly.
What needed work: we lacked formal policy documentation for AI-specific security concerns. We wrote the policy during the audit response cycle.
Question 13: "What is your retention policy for AI training data, model artifacts, and inference logs?"
Retention requirements cross multiple regulations. The audit team wanted explicit policies.
Our retention policy by category:
- Raw training datasets: retained per data class (transaction data: 7 years per regulatory requirement, customer service logs: 2 years per privacy policy)
- Preprocessed/aggregated training data: retained 18 months after model retirement
- Production model artifacts: retained for the operational life of the model plus 12 months
- Test/experimental models: retained 90 days after experiment closure
- Inference logs: retained per the input data class
- Model metrics and performance data: retained 5 years
Documentation: explicit retention policy with rationale for each timeframe, integration with automated lifecycle management.
What worked: explicit categorization. Auditors could trace each data class to a specific retention policy.
What needed work: lifecycle automation was incomplete when first audited. Some test models persisted longer than 90 days because automation didn't catch them. We fixed the automation gap.
Question 14: "Can you demonstrate that AI workloads cannot access data they should not access?"
This is the integrity question. Auditors want positive proof of isolation, not just policy documentation.
Our answer: "We perform isolation testing quarterly. Test workloads attempt to access prohibited data and verify access is denied at multiple layers."
We provided:
- Test plan documentation
- Quarterly test execution evidence
- Test result summary showing all access attempts blocked
- Specific examples of layered controls preventing access
What worked: regular automated testing. Auditors could see the test was actually run and saw the results.
What needed work: test coverage was uneven across data categories. We expanded test cases to cover all data classes systematically.
Operational controls
Operational questions focus on day-to-day management of AI infrastructure.
Question 15: "How do you monitor AI infrastructure for security events?"
This question is about detection, not prevention.
Our answer:
- DCGM integration with SIEM for GPU-specific events
- Standard infrastructure monitoring (vCenter, OneView) integrated with SIEM
- Network flow monitoring for unusual patterns
- Audit log aggregation across all AI-relevant systems
- Defined alert rules for security-relevant events
The auditor asked for examples of alerts: "What would trigger a security alert, and what is the response procedure?"
We provided:
- Alert rules table (with severity, condition, response)
- Sample security incidents from the past 12 months
- Response time evidence (mean time to acknowledge, mean time to resolve)
- Postmortem documents for non-trivial incidents
What worked: monitoring extended to AI infrastructure, not bolt-on. Auditors saw integrated visibility.
What needed work: some AI-specific events (model serving anomalies, training data drift) were not in the original alert rules. We expanded coverage during the audit.
Question 16: "What is your incident response procedure if AI infrastructure is compromised?"
Specific incident response for AI workloads.
Our answer integrated AI scenarios into existing incident response playbooks:
- AI workload compromise → standard malicious code response
- Training data exfiltration suspected → data breach response with AI-specific evidence collection
- Model integrity concerns → model rollback procedure plus investigation
- GPU/NVAIE licensing alert → vendor coordination plus operational continuity
Documentation provided:
- Updated IR playbook including AI scenarios
- Tabletop exercise results testing AI-related scenarios
- Coordination procedures with NVIDIA and OEM support
- Communication plans for AI-specific incidents
What worked: integration with existing IR rather than parallel procedures.
What needed work: tabletop exercises had not specifically tested AI scenarios. We ran two new tabletops during the audit response cycle.
Question 17: "How do you handle vulnerability management for NVIDIA software and GPU firmware?"
This question is about staying current with security updates.
Our answer:
- NVIDIA security advisory subscription
- CVE tracking for NVIDIA components
- Standard patch management workflow with AI-specific compatibility validation
- Emergency patch procedures for critical CVEs
The auditor asked: "What is your patch SLA for AI infrastructure compared to traditional infrastructure?"
We provided:
- Patch SLA: Critical (7 days), High (30 days), Medium (90 days), Low (next maintenance window)
- Evidence of patches applied within SLA in the audit period
- Exceptions documented with risk acceptance from appropriate authority
What worked: same SLA as other infrastructure, no AI-specific exceptions.
What needed work: NVIDIA driver compatibility sometimes blocked us from applying patches immediately. We needed clearer escalation procedures when compatibility issues delayed patching. We documented escalation paths.
Vendor and third-party risk
AI infrastructure introduces vendor dependencies that auditors want to understand.
Question 18: "What is your vendor risk assessment for NVIDIA?"
NVIDIA is essentially unavoidable for AI infrastructure. The question is about managing that dependency.
Our answer:
- Standard vendor risk assessment performed annually
- Vendor SOC 2 reports reviewed
- Contractual provisions for data protection, audit rights, breach notification
- Operational dependency mapping (what would happen if NVIDIA services were unavailable)
- Alternative supplier evaluation (limited but documented)
The auditor asked: "What is your business continuity plan if NVIDIA licensing services are unavailable?"
We documented:
- NVIDIA License Server (NLS) 7-day grace period for cached licenses
- Local NLS deployment reduces dependency on internet connectivity
- Documented degraded mode procedures
- Communication plan for extended outages
What worked: explicit dependency analysis with documented mitigation.
What needed work: alternative supplier evaluation was thin. We added more detail on what GPU alternatives would entail operationally (AMD MI300X, Intel Gaudi, ASIC alternatives).
Question 19: "How are AI framework components reviewed before deployment?"
This question is about open-source supply chain.
Our answer: AI frameworks (PyTorch, TensorFlow, vLLM, etc.) go through our standard open-source software review:
- Dependency scanning for known CVEs
- License compatibility review
- Code provenance verification where possible
- Container image scanning for production images
- Internal mirror with controlled updates
The auditor probed: "How do you handle the case where a framework has a critical CVE but no patched version is available?"
Our procedure:
- Immediate risk assessment of the CVE in our specific deployment
- Compensating controls (network restrictions, monitoring) if remediation is delayed
- Risk acceptance documentation with appropriate approval
- Tracking for eventual patching
What worked: applying existing OSS review processes to AI frameworks.
What needed work: AI-specific framework velocity (releases every few weeks for some components) strained our review process. We added a fast-track review for AI frameworks with reduced approval cycles for incremental updates.
Findings and remediation
Across multiple audit cycles, the findings we received clustered around predictable patterns. Sharing them as they may help others avoid similar issues.
Common finding 1: Documentation gaps
Most frequent finding category. We had implemented controls correctly but had not formally documented them.
Pattern: technical control exists → operationally working → not in written policy
Remediation: documentation projects to formalize existing practices.
Lesson: write documentation before deployment, not during audit response. The work is similar but the timeline is calmer.
Common finding 2: Policy gaps for new categories
When AI workloads introduced new data categories or new operational patterns, existing policies sometimes didn't apply cleanly.
Pattern: existing policy doesn't address AI-specific scenario → operational practice fills the gap → policy formalization happens after the fact
Remediation: policy updates to explicitly address AI categories.
Lesson: review existing policies for AI applicability before deployment, not after.
Common finding 3: Test coverage incomplete
Isolation testing, access reviews, and other regular validations sometimes had gaps in AI coverage.
Pattern: existing test coverage doesn't include AI-specific scenarios → audit identifies gap
Remediation: expand test coverage to include AI workloads.
Lesson: when adding new workload classes, expand test plans before audit cycle.
Common finding 4: Automation gaps
Manual processes that worked operationally sometimes failed audit because they relied on individual diligence rather than systematic enforcement.
Pattern: process worked when operations team remembered → audit sample found cases where it didn't
Remediation: automation for processes that needed to scale.
Lesson: anything that requires "remember to do X" eventually fails. Automate or formalize escalation.
Finding I am proud of
Across multiple audit cycles, we received zero high-severity findings related to data protection. Our isolation controls held up under audit scrutiny because we designed them as primary architectural decisions, not afterthoughts.
This is not luck — it is investment in correct architecture upfront. The teams that struggle on audit are usually the teams that bolted security onto deployed infrastructure rather than designing it in.
What I would recommend to others starting this journey
For infrastructure operators preparing for AI workload deployment in regulated environments:
1. Engage compliance early
Bring compliance team into the AI deployment conversation before you finalize architecture. Their requirements shape architecture, not the other way around.
We learned this lesson in the wrong order. Architecture review happened after preliminary design. Some design choices had to be reworked when compliance requirements became clearer. Engaging earlier would have saved rework.
2. Map existing controls to AI scenarios
Before assuming you need new AI-specific controls, map existing controls to AI scenarios. Most controls apply with minor adjustments. New controls add complexity without necessarily adding security.
Our approach: take each control from our existing control framework, ask "does this apply to AI workloads, and if so how does it need adjustment." This exercise produced cleaner audit outcomes than starting with "AI-specific controls" framework.
3. Document the data lineage exhaustively
Audit conversations always come back to data flows. Invest in clear, current, detailed data flow documentation before deployment.
Our documentation included: source systems, processing steps, storage locations, access patterns, downstream consumers, retention rules. For every AI workflow.
This documentation answered most audit questions before they were asked.
4. Build test cases for isolation enforcement
Don't wait for audit to test isolation. Build regular automated test cases that verify AI workloads can only access what they should access.
Quarterly testing with documented evidence solves a class of audit conversations efficiently.
5. Plan for findings even with good preparation
Even well-prepared teams receive findings. They are usually documentation gaps or test coverage gaps rather than fundamental control failures. Plan time for findings response in your AI deployment timeline.
We budget 4-6 weeks of post-audit remediation work for every major audit cycle. Not all findings are AI-related, but AI workloads typically generate some portion of findings during initial audit cycles.
6. Build relationships with auditors
The audit conversation works better when auditors trust the auditee team. Trust builds over time through consistent honest communication.
We invest in audit relationships proactively: explain new initiatives before they are deployed, share documentation in advance, respond to questions transparently. The investment pays back in smoother audit cycles.
What I would do differently
Looking back at our AI deployment audit experience:
1. Built compliance documentation in parallel with architecture
We treated compliance documentation as something that happened after deployment was complete. This was wrong. The documentation effort was 3-4 times harder doing it retrospectively than doing it concurrently with architecture decisions.
Recommendation: write the audit response document as you design the system. The questions are predictable. Having answers prepared during design forces better design decisions.
2. Engaged external audit support earlier
We engaged external audit consultants late in the deployment cycle. They identified concerns we had not anticipated. Earlier engagement would have prevented some architectural rework.
Recommendation: budget for external audit consultation in the early design phase, not just before formal audit.
3. Trained internal audit team on AI infrastructure
Our internal audit team's first exposure to AI infrastructure was during the actual audit. They were learning while auditing. This was awkward for both sides.
Recommendation: brief internal audit team on AI infrastructure plans during architecture phase. Familiarity reduces audit friction.
4. Built control automation more systematically
Some controls worked manually but did not scale. We retrofitted automation under audit pressure.
Recommendation: design for automated enforcement of controls, not manual diligence. Manual controls fail audits eventually.
5. Maintained AI-specific risk register
We maintained an AI-specific risk register starting in year two of operations. Year one risks were tracked in general risk management. Specific AI risk register would have made some audit conversations easier.
Recommendation: maintain explicit AI-specific risk register from day one of AI deployment.
Closing notes
AI infrastructure in regulated environments is operationally feasible but requires deliberate compliance engineering. The audit questions are predictable enough that prepared teams handle them effectively. The teams that struggle are those that deployed AI first and worried about compliance second.
The questions documented here are not exhaustive. Every audit cycle brings new questions, especially as regulations evolve (EU AI Act provisions taking effect, sector-specific AI guidance maturing, financial regulators issuing AI-specific guidance). The pattern is that auditors learn what to ask about AI, and the question set expands.
The investment in compliance documentation, control mapping, isolation testing, and audit relationships pays back across multiple audit cycles. The teams that build this discipline operate AI workloads in regulated environments confidently. The teams that don't end up either constraining their AI deployments significantly or accepting higher audit risk than is comfortable.
For my own team, the cycle of audit questions has gotten easier over time. The first cycle was hard — lots of new ground, many follow-up questions, several findings. The second cycle was easier — we had documentation prepared, processes formalized, controls automated. The third cycle felt routine. The infrastructure didn't change much, but our ability to explain it to auditors got much better.
Future articles will cover the specific audit evidence preparation patterns we use (templates, automation, lifecycle), the change management workflows for AI infrastructure that satisfy compliance frameworks, and the operational metrics that compliance teams find most useful. Subscribe to follow along.
Notes from operating AI infrastructure under regulatory frameworks. Audit questions and patterns documented here reflect multiple audit cycles across PCI DSS, ISO 27001, and regulatory inspections. Specific findings, dates, and organizational details are anonymized. The patterns are real and reflect what auditors typically ask. Your specific audit framework, regulatory context, and organizational culture will produce different specifics; the general patterns should generalize. I am an architect and auditee, not a certified auditor — this is operator perspective on the audit relationship, not audit guidance.
Top comments (0)