Security and privacy of customer workloads is an essential design tenet for Oracle Cloud Infrastructure (OCI) Generative AI. In an era where data breaches and AI security vulnerabilities dominate headlines, OCI has architected a comprehensive security framework that ensures enterprise-grade protection for AI workloads. This deep dive explores the multi-layered security approach that makes OCI Generative AI one of the most secure platforms for enterprise AI deployments.
The Foundation: GPU Isolation and Dedicated Infrastructure
Dedicated GPU Allocation
GPUs allocated for a customer's generative AI tasks are pooled within a dedicated RDMA network, ensuring they are exclusively allocated to a single customer and not shared with others. This fundamental isolation guarantees that customer data remains secure and inaccessible to unauthorized parties.
Key Isolation Characteristics:
Physical Separation: Each customer's GPU cluster operates on dedicated hardware, preventing any cross-contamination or data leakage between tenants.
Network Isolation: RDMA cluster networks with less than 10 microsecond latency provide ultra-high-bandwidth connections (1.6 TB/sec internode bandwidth) while maintaining complete isolation from other customers' workloads.
Compute Isolation: OCI Supercluster can deploy up to 32,768 GPUs per cluster, all dedicated to a single customer with no sharing of computational resources.
RDMA Network Architecture
Remote Direct Memory Access (RDMA) is a critical component of OCI's AI infrastructure, providing both performance and security benefits.
RDMA Technology Overview:
RDMA allows for low-latency connections between nodes and access to GPU memory without involving the CPU. This technology enables:
- Ultra-Low Latency: Less than 10 microseconds between nodes
- High Bandwidth: 1.6 TB/sec internode bandwidth for massive data transfers
- CPU Offload: Direct memory-to-memory transfers without CPU intervention
- Security: Dedicated network fabric preventing cross-tenant access
Network Types:
OCI's cluster network uses RDMA over Converged Ethernet Version 2 (RoCE v2) on top of NVIDIA ConnectX-7 network interface cards (NICs) to support high-throughput and latency-sensitive workloads.
For advanced deployments, OCI utilizes NVIDIA Quantum InfiniBand or NVIDIA Spectrum-X Ethernet with RDMA over Converged Ethernet (RoCE) configuration.
Scale and Performance:
OCI enables customers to cluster up to 4,096 bare metal nodes, each with 8 GPUs, totaling up to 32,768 GPUs. The latest infrastructure scales even further—zettascale OCI Supercluster scales up to 131,072 GPUs, making it the largest hyperscale AI supercomputer in the cloud.
Hardware Security Modules (HSM)
Keys are stored on Hardware Security Modules (HSMs) that meet Federal Information Processing Standards (FIPS) 140-2 Security Level 3 security certification.
HSM Protection Modes:
Software Protection: Keys stored and processed on servers, recommended for most use cases with strong encryption.
HSM Protection: Keys stored on HSMs meeting FIPS 140-2 Security Level 3 certification, recommended for stringent compliance requirements like financial services and healthcare.
Model Endpoint Security
Single-Customer Model Endpoints
For strong data privacy and security, a dedicated GPU cluster only handles fine-tuned models for a single customer. This architectural decision ensures:
Data Isolation: No model from one customer processes data from another customer
Performance Isolation: Resources allocated exclusively for your workloads
Compliance: Easier to meet regulatory requirements with clear tenant boundaries
Security Boundaries: Physical and logical separation of processing environments
Endpoint Access Control
Model endpoints implement multiple layers of access control:
Authentication: API keys, OAuth tokens, or IAM credentials required for all requests
Authorization: Role-Based Access Control (RBAC) determines what actions authenticated users can perform
Network Security: Private endpoints within Virtual Cloud Networks (VCNs), with optional public access through controlled ingress
Rate Limiting: Configurable throttling to prevent abuse and ensure fair resource allocation
Customer Data and Model Isolation
Tenancy-Level Isolation
Customer data access is restricted within a customer's tenancy so that one customer's data cannot be seen by another customer. This fundamental principle manifests across multiple dimensions:
Compartment Isolation:
OCI uses compartments to organize and isolate resources. The architecture implements compartmentalization and private subnets to isolate different operational environments.
Best Practice: Create separate compartments for:
- Development environments
- Staging/testing
- Production workloads
- Different business units or projects
Data Separation:
Training data, fine-tuned model weights, embeddings, and inference logs are all stored within customer-controlled storage with strict access controls preventing cross-tenant access.
Workload Isolation:
A single NVIDIA GB200 NVL72 rack can be configured to launch multiple NVLink groupings for smaller workloads and provide strong isolation for efficient workload distribution.
Network-Level Isolation
Private Subnets:
Resources deployed in private subnets within VCNs, preventing direct internet access unless explicitly configured.
Security Lists and Network Security Groups (NSGs):
Stateful firewalls controlling inbound and outbound traffic at the subnet and network interface level.
Service Gateway and NAT Gateway:
Controlled access to Oracle services and internet resources without exposing instances to inbound internet traffic.
Integrated OCI Security Services
OCI Generative AI leverages Oracle's comprehensive security ecosystem to provide defense-in-depth.
OCI Identity and Access Management (IAM)
IAM provides the foundation for authentication and authorization across all OCI services.
Key IAM Capabilities:
User and Group Management: Create users, organize them into groups, and assign permissions based on job functions.
Policies: Stringent access controls using OCI IAM policies enforce least-privilege access.
Example IAM Policy for Generative AI:
Allow group GenAI-Engineers to manage generative-ai-family in compartment GenAI-Prod
Allow group GenAI-Engineers to use virtual-network-family in compartment GenAI-Networking
Allow group Data-Scientists to read generative-ai-endpoints in compartment GenAI-Prod
Identity Providers: Integration with corporate identity systems (SAML, SCIM) for single sign-on.
Multi-Factor Authentication (MFA): Additional security layer requiring multiple forms of verification.
OCI Key Management Service (Vault)
The Key Management Service is critical for protecting data at rest and managing encryption keys.
Vault Architecture:
Oracle Vault is a logical grouping of keys. There are two types of vaults: Private and Virtual, which have different levels of isolation, pricing, and computing.
Key Features:
Centralized Key Management: OCI Key Management provides centralized management of the encryption of your data.
Customer-Managed Keys (CMK): Customers create and control their own encryption keys, maintaining full ownership.
Oracle-Managed Keys: Default encryption with Oracle-managed keys that are automatically rotated.
Key Rotation: Each master encryption key (MEK) is automatically assigned a key version. When you rotate a key, the Vault service generates a new key version.
Integration with Generative AI:
OCI Generative AI uses Key Management Service to:
- Store Base Model Keys Securely: Encryption keys for foundational models are managed through Vault
- Encrypt Fine-Tuned Models: Custom model weights encrypted with customer-controlled keys
- Protect Training Data: Datasets used for fine-tuning encrypted before storage
- Secure Inference Data: Input and output data encrypted in transit and at rest
OCI Object Storage Security
Object Storage buckets store critical AI assets including training data, model weights, and embeddings.
Default Encryption:
OCI encrypts all objects by default with Oracle-managed keys, which are periodically rotated. This ensures data is protected even if customers don't explicitly configure encryption.
Customer-Managed Encryption:
OCI provides options for customers to use their own keys for more security. Using IAM policies that authenticate users performing tasks, a best practice is creating and periodically rotating encryption keys in the Vault service to protect resources in Object Storage.
Encryption at Multiple Layers:
Object-Level Encryption: Each object encrypted with a unique Data Encryption Key (DEK)
Bucket-Level Encryption: Entire buckets can be encrypted with a specific Master Encryption Key (MEK)
Client-Side Encryption: Customers can encrypt data before sending it to Oracle Object Storage for additional control
Required IAM Policies:
To use customer-managed keys with Object Storage, create a policy like:
allow service objectstorage-<region_name> to use keys in compartment <compartment_name>
Access Control:
Pre-Authenticated Requests (PARs): Time-bound access to objects without requiring IAM credentials
Bucket Visibility: Private by default, with explicit configuration required for public access
Versioning: Track and restore previous versions of objects
Security Best Practices:
Restrict BUCKET_UPDATE permission to a minimal set of IAM groups to minimize the possibility of existing buckets being made public inadvertently or maliciously.
Cloud Guard: Active Threat Detection
Comprehensive logging and audit trails maintained for all significant operations.
Cloud Guard continuously monitors your OCI environment for security weaknesses and threats.
Detector Rules for Object Storage:
Cloud Guard includes detector rules for Object Storage to identify public buckets, unencrypted data, and other security issues.
Capabilities:
Automated Detection: Identifies misconfigurations and security threats in real-time
Responder Actions: Automatically remediate issues or notify administrators
Problem Dashboard: Centralized view of all detected security problems
Integration: Works across all OCI services including Generative AI resources
Data Protection Throughout the AI Lifecycle
Training Data Security
Ingestion:
- Data uploaded to Object Storage with automatic encryption
- Customer-managed keys available for additional control
- Access restricted via IAM policies
Processing:
- Training occurs on dedicated GPUs isolated within customer's RDMA network
- No data sharing between customers
- Temporary data purged after training completion
Storage:
- Fine-tuned model weights stored in Object Storage buckets
- Encrypted by default and managed by Key Management Service
- Version control for model iterations
Inference Data Security
Input Processing:
- Requests authenticated via IAM or API keys
- Data encrypted in transit using TLS 1.2+
- Processed on dedicated endpoints with no cross-tenant access
Output Generation:
- Responses encrypted in transit
- Optional logging with encryption at rest
- Retention policies configurable per use case
Monitoring:
- Continuous monitoring of resource usage to proactively mitigate contention
- Audit logs for all API calls
- Integration with OCI Logging Analytics
Advanced Security Features
Security Zones
Security Zones provide compartment-associated Oracle-defined recipes of security policies based on best practices.
Enforced Policies:
No Public Access: Resources cannot be accessible from public internet
Customer-Managed Keys: Data encryption enforced using Customer-Managed Keys for block volumes, boot volumes, and Object Storage buckets
Deny Public Buckets: Public access to Object Storage buckets is denied to prevent accidental exposure
Use Case:
Deploy Generative AI workloads in Security Zones to automatically enforce security best practices and maintain compliance.
Compliance and Certifications
OCI maintains extensive compliance certifications relevant to AI workloads:
Regional Compliance:
- GDPR: European data protection requirements
- HIPAA: Healthcare data in the United States
- FedRAMP: U.S. government cloud security
- ISO 27001/27017/27018: International security standards
Industry-Specific:
- PCI DSS: Payment card industry
- SOC 1/2/3: Service organization controls
- FIPS 140-2 Level 3: Cryptographic module security
Sovereign Cloud Options
Oracle's distributed cloud, AI infrastructure, and generative AI services enable governments and enterprises to deploy AI factories that run cloud services locally and within a country's secure premises with operational controls supporting sovereign goals.
Available Options:
OCI Dedicated Region: Entire cloud region operated exclusively for a single customer
Oracle Alloy: Cloud platform for partners to deliver sovereign cloud
Oracle EU Sovereign Cloud: European data residency and operational sovereignty
Oracle Government Cloud: FedRAMP-authorized regions for U.S. government
Best Practices for Secure Generative AI Deployments
1. Implement Least-Privilege Access
Principle: Grant only the minimum permissions necessary for each role.
Implementation:
# Development team - read-only on production
Allow group GenAI-Developers to read generative-ai-family in compartment GenAI-Prod
# Production team - full management
Allow group GenAI-Production to manage generative-ai-family in compartment GenAI-Prod
2. Use Customer-Managed Keys
Benefit: Full control over encryption key lifecycle and usage.
Steps:
- Create Vault in Key Management Service
- Generate or import master encryption key
- Create IAM policy allowing services to use key
- Configure buckets and resources to use customer key
3. Enable Comprehensive Logging
Audit Logs: Enable audit logging for all API calls to Generative AI services
Flow Logs: Monitor network traffic to/from GPU clusters
Application Logs: Capture inference requests and responses for analysis
4. Leverage Network Isolation
Private Endpoints: Deploy model endpoints in private subnets
Service Gateway: Access Oracle services without internet exposure
Bastion Hosts: Secure administrative access to resources
5. Implement Defense-in-Depth
Multiple Layers:
- Network security (NSGs, Security Lists)
- Identity security (IAM, MFA)
- Data security (encryption at rest and in transit)
- Application security (input validation, rate limiting)
- Monitoring security (Cloud Guard, Logging Analytics)
6. Regular Security Reviews
Periodic Assessments:
- Review IAM policies quarterly
- Rotate encryption keys according to policy
- Audit user access and remove unused accounts
- Update network security rules based on traffic patterns
7. Secure Development Practices
The architecture includes a CI/CD pipeline for promoting models from the playground environment to production.
Separation of Environments:
- Development: Unrestricted experimentation with synthetic data
- Staging: Production-like with access controls
- Production: Maximum security with customer data
Monitoring and Incident Response
Real-Time Monitoring
Metrics to Track:
- API request rates and latency
- Authentication failures
- Unusual access patterns
- Resource utilization anomalies
Alerting:
Configure Cloud Events to trigger notifications for:
- IAM policy changes
- Network security group modifications
- Encryption key access
- Dedicated cluster scaling events
Incident Response
Preparation:
- Define incident response procedures
- Establish communication channels
- Designate response team members
Detection:
- Cloud Guard active monitoring
- Log analysis for anomalies
- User-reported issues
Containment:
- Isolate affected resources
- Revoke compromised credentials
- Block suspicious network traffic
Recovery:
- Restore from backups if necessary
- Rotate encryption keys
- Update security configurations
Lessons Learned:
- Document incident timeline
- Identify root causes
- Update procedures and controls
Future Security Enhancements
Oracle continues investing in security capabilities:
AI-Powered Security:
Machine learning models detecting anomalous behavior in API usage patterns and access patterns.
Zero Trust Architecture:
Moving toward continuous verification of all access requests regardless of source.
Confidential Computing:
NVIDIA BlueField-3 DPUs accelerate networking, storage, and security workloads, enabling hardware-based isolation for sensitive computations.
Enhanced Compliance:
Expanding certifications and attestations for emerging regulatory requirements.
OCI Generative AI security is built on a comprehensive, multi-layered approach that addresses the unique challenges of enterprise AI deployments:
Infrastructure Security:
- Dedicated GPUs pooled within RDMA networks exclusively for single customers
- Dedicated GPU clusters handling only customer's base and fine-tuned models
- Ultra-low latency RDMA networking with complete tenant isolation
Data Security:
- Customer data access restricted within tenancy
- Object Storage encrypted by default with periodic key rotation
- Customer-managed keys for maximum control
Identity and Access:
- Comprehensive IAM with fine-grained policies
- Integration with enterprise identity providers
- Multi-factor authentication support
Compliance:
- FIPS 140-2 Level 3 certified HSMs
- Sovereign cloud options for data residency
- Extensive regulatory certifications
Security and privacy of customer workloads is truly an essential design tenet at OCI. By combining dedicated infrastructure, comprehensive encryption, identity controls, and continuous monitoring, OCI Generative AI provides enterprise-grade security that enables organizations to confidently deploy AI at scale.
How does your organization approach AI security? What security features are most critical for your use cases? Share your experiences in the comments
Top comments (0)