DEV Community

Ryan Giggs
Ryan Giggs

Posted on

OCI Generative AI Security: Dedicated GPUs, RDMA Networks, and Enterprise-Grade Data Protection

Security and privacy of customer workloads is an essential design tenet for Oracle Cloud Infrastructure (OCI) Generative AI. In an era where data breaches and AI security vulnerabilities dominate headlines, OCI has architected a comprehensive security framework that ensures enterprise-grade protection for AI workloads. This deep dive explores the multi-layered security approach that makes OCI Generative AI one of the most secure platforms for enterprise AI deployments.

The Foundation: GPU Isolation and Dedicated Infrastructure

Dedicated GPU Allocation

GPUs allocated for a customer's generative AI tasks are pooled within a dedicated RDMA network, ensuring they are exclusively allocated to a single customer and not shared with others. This fundamental isolation guarantees that customer data remains secure and inaccessible to unauthorized parties.

Key Isolation Characteristics:

Physical Separation: Each customer's GPU cluster operates on dedicated hardware, preventing any cross-contamination or data leakage between tenants.

Network Isolation: RDMA cluster networks with less than 10 microsecond latency provide ultra-high-bandwidth connections (1.6 TB/sec internode bandwidth) while maintaining complete isolation from other customers' workloads.

Compute Isolation: OCI Supercluster can deploy up to 32,768 GPUs per cluster, all dedicated to a single customer with no sharing of computational resources.

RDMA Network Architecture

Remote Direct Memory Access (RDMA) is a critical component of OCI's AI infrastructure, providing both performance and security benefits.

RDMA Technology Overview:

RDMA allows for low-latency connections between nodes and access to GPU memory without involving the CPU. This technology enables:

  • Ultra-Low Latency: Less than 10 microseconds between nodes
  • High Bandwidth: 1.6 TB/sec internode bandwidth for massive data transfers
  • CPU Offload: Direct memory-to-memory transfers without CPU intervention
  • Security: Dedicated network fabric preventing cross-tenant access

Network Types:

OCI's cluster network uses RDMA over Converged Ethernet Version 2 (RoCE v2) on top of NVIDIA ConnectX-7 network interface cards (NICs) to support high-throughput and latency-sensitive workloads.

For advanced deployments, OCI utilizes NVIDIA Quantum InfiniBand or NVIDIA Spectrum-X Ethernet with RDMA over Converged Ethernet (RoCE) configuration.

Scale and Performance:

OCI enables customers to cluster up to 4,096 bare metal nodes, each with 8 GPUs, totaling up to 32,768 GPUs. The latest infrastructure scales even further—zettascale OCI Supercluster scales up to 131,072 GPUs, making it the largest hyperscale AI supercomputer in the cloud.

Hardware Security Modules (HSM)

Keys are stored on Hardware Security Modules (HSMs) that meet Federal Information Processing Standards (FIPS) 140-2 Security Level 3 security certification.

HSM Protection Modes:

Software Protection: Keys stored and processed on servers, recommended for most use cases with strong encryption.

HSM Protection: Keys stored on HSMs meeting FIPS 140-2 Security Level 3 certification, recommended for stringent compliance requirements like financial services and healthcare.

Model Endpoint Security

Single-Customer Model Endpoints

For strong data privacy and security, a dedicated GPU cluster only handles fine-tuned models for a single customer. This architectural decision ensures:

Data Isolation: No model from one customer processes data from another customer
Performance Isolation: Resources allocated exclusively for your workloads
Compliance: Easier to meet regulatory requirements with clear tenant boundaries
Security Boundaries: Physical and logical separation of processing environments

Endpoint Access Control

Model endpoints implement multiple layers of access control:

Authentication: API keys, OAuth tokens, or IAM credentials required for all requests
Authorization: Role-Based Access Control (RBAC) determines what actions authenticated users can perform
Network Security: Private endpoints within Virtual Cloud Networks (VCNs), with optional public access through controlled ingress
Rate Limiting: Configurable throttling to prevent abuse and ensure fair resource allocation

Customer Data and Model Isolation

Tenancy-Level Isolation

Customer data access is restricted within a customer's tenancy so that one customer's data cannot be seen by another customer. This fundamental principle manifests across multiple dimensions:

Compartment Isolation:

OCI uses compartments to organize and isolate resources. The architecture implements compartmentalization and private subnets to isolate different operational environments.

Best Practice: Create separate compartments for:

  • Development environments
  • Staging/testing
  • Production workloads
  • Different business units or projects

Data Separation:

Training data, fine-tuned model weights, embeddings, and inference logs are all stored within customer-controlled storage with strict access controls preventing cross-tenant access.

Workload Isolation:

A single NVIDIA GB200 NVL72 rack can be configured to launch multiple NVLink groupings for smaller workloads and provide strong isolation for efficient workload distribution.

Network-Level Isolation

Private Subnets:

Resources deployed in private subnets within VCNs, preventing direct internet access unless explicitly configured.

Security Lists and Network Security Groups (NSGs):

Stateful firewalls controlling inbound and outbound traffic at the subnet and network interface level.

Service Gateway and NAT Gateway:

Controlled access to Oracle services and internet resources without exposing instances to inbound internet traffic.

Integrated OCI Security Services

OCI Generative AI leverages Oracle's comprehensive security ecosystem to provide defense-in-depth.

OCI Identity and Access Management (IAM)

IAM provides the foundation for authentication and authorization across all OCI services.

Key IAM Capabilities:

User and Group Management: Create users, organize them into groups, and assign permissions based on job functions.

Policies: Stringent access controls using OCI IAM policies enforce least-privilege access.

Example IAM Policy for Generative AI:

Allow group GenAI-Engineers to manage generative-ai-family in compartment GenAI-Prod
Allow group GenAI-Engineers to use virtual-network-family in compartment GenAI-Networking
Allow group Data-Scientists to read generative-ai-endpoints in compartment GenAI-Prod
Enter fullscreen mode Exit fullscreen mode

Identity Providers: Integration with corporate identity systems (SAML, SCIM) for single sign-on.

Multi-Factor Authentication (MFA): Additional security layer requiring multiple forms of verification.

OCI Key Management Service (Vault)

The Key Management Service is critical for protecting data at rest and managing encryption keys.

Vault Architecture:

Oracle Vault is a logical grouping of keys. There are two types of vaults: Private and Virtual, which have different levels of isolation, pricing, and computing.

Key Features:

Centralized Key Management: OCI Key Management provides centralized management of the encryption of your data.

Customer-Managed Keys (CMK): Customers create and control their own encryption keys, maintaining full ownership.

Oracle-Managed Keys: Default encryption with Oracle-managed keys that are automatically rotated.

Key Rotation: Each master encryption key (MEK) is automatically assigned a key version. When you rotate a key, the Vault service generates a new key version.

Integration with Generative AI:

OCI Generative AI uses Key Management Service to:

  1. Store Base Model Keys Securely: Encryption keys for foundational models are managed through Vault
  2. Encrypt Fine-Tuned Models: Custom model weights encrypted with customer-controlled keys
  3. Protect Training Data: Datasets used for fine-tuning encrypted before storage
  4. Secure Inference Data: Input and output data encrypted in transit and at rest

OCI Object Storage Security

Object Storage buckets store critical AI assets including training data, model weights, and embeddings.

Default Encryption:

OCI encrypts all objects by default with Oracle-managed keys, which are periodically rotated. This ensures data is protected even if customers don't explicitly configure encryption.

Customer-Managed Encryption:

OCI provides options for customers to use their own keys for more security. Using IAM policies that authenticate users performing tasks, a best practice is creating and periodically rotating encryption keys in the Vault service to protect resources in Object Storage.

Encryption at Multiple Layers:

Object-Level Encryption: Each object encrypted with a unique Data Encryption Key (DEK)
Bucket-Level Encryption: Entire buckets can be encrypted with a specific Master Encryption Key (MEK)
Client-Side Encryption: Customers can encrypt data before sending it to Oracle Object Storage for additional control

Required IAM Policies:

To use customer-managed keys with Object Storage, create a policy like:

allow service objectstorage-<region_name> to use keys in compartment <compartment_name>
Enter fullscreen mode Exit fullscreen mode

Access Control:

Pre-Authenticated Requests (PARs): Time-bound access to objects without requiring IAM credentials
Bucket Visibility: Private by default, with explicit configuration required for public access
Versioning: Track and restore previous versions of objects

Security Best Practices:

Restrict BUCKET_UPDATE permission to a minimal set of IAM groups to minimize the possibility of existing buckets being made public inadvertently or maliciously.

Cloud Guard: Active Threat Detection

Comprehensive logging and audit trails maintained for all significant operations.

Cloud Guard continuously monitors your OCI environment for security weaknesses and threats.

Detector Rules for Object Storage:

Cloud Guard includes detector rules for Object Storage to identify public buckets, unencrypted data, and other security issues.

Capabilities:

Automated Detection: Identifies misconfigurations and security threats in real-time
Responder Actions: Automatically remediate issues or notify administrators
Problem Dashboard: Centralized view of all detected security problems
Integration: Works across all OCI services including Generative AI resources

Data Protection Throughout the AI Lifecycle

Training Data Security

Ingestion:

  • Data uploaded to Object Storage with automatic encryption
  • Customer-managed keys available for additional control
  • Access restricted via IAM policies

Processing:

  • Training occurs on dedicated GPUs isolated within customer's RDMA network
  • No data sharing between customers
  • Temporary data purged after training completion

Storage:

  • Fine-tuned model weights stored in Object Storage buckets
  • Encrypted by default and managed by Key Management Service
  • Version control for model iterations

Inference Data Security

Input Processing:

  • Requests authenticated via IAM or API keys
  • Data encrypted in transit using TLS 1.2+
  • Processed on dedicated endpoints with no cross-tenant access

Output Generation:

  • Responses encrypted in transit
  • Optional logging with encryption at rest
  • Retention policies configurable per use case

Monitoring:

  • Continuous monitoring of resource usage to proactively mitigate contention
  • Audit logs for all API calls
  • Integration with OCI Logging Analytics

Advanced Security Features

Security Zones

Security Zones provide compartment-associated Oracle-defined recipes of security policies based on best practices.

Enforced Policies:

No Public Access: Resources cannot be accessible from public internet
Customer-Managed Keys: Data encryption enforced using Customer-Managed Keys for block volumes, boot volumes, and Object Storage buckets
Deny Public Buckets: Public access to Object Storage buckets is denied to prevent accidental exposure

Use Case:

Deploy Generative AI workloads in Security Zones to automatically enforce security best practices and maintain compliance.

Compliance and Certifications

OCI maintains extensive compliance certifications relevant to AI workloads:

Regional Compliance:

  • GDPR: European data protection requirements
  • HIPAA: Healthcare data in the United States
  • FedRAMP: U.S. government cloud security
  • ISO 27001/27017/27018: International security standards

Industry-Specific:

  • PCI DSS: Payment card industry
  • SOC 1/2/3: Service organization controls
  • FIPS 140-2 Level 3: Cryptographic module security

Sovereign Cloud Options

Oracle's distributed cloud, AI infrastructure, and generative AI services enable governments and enterprises to deploy AI factories that run cloud services locally and within a country's secure premises with operational controls supporting sovereign goals.

Available Options:

OCI Dedicated Region: Entire cloud region operated exclusively for a single customer
Oracle Alloy: Cloud platform for partners to deliver sovereign cloud
Oracle EU Sovereign Cloud: European data residency and operational sovereignty
Oracle Government Cloud: FedRAMP-authorized regions for U.S. government

Best Practices for Secure Generative AI Deployments

1. Implement Least-Privilege Access

Principle: Grant only the minimum permissions necessary for each role.

Implementation:

# Development team - read-only on production
Allow group GenAI-Developers to read generative-ai-family in compartment GenAI-Prod

# Production team - full management
Allow group GenAI-Production to manage generative-ai-family in compartment GenAI-Prod
Enter fullscreen mode Exit fullscreen mode

2. Use Customer-Managed Keys

Benefit: Full control over encryption key lifecycle and usage.

Steps:

  1. Create Vault in Key Management Service
  2. Generate or import master encryption key
  3. Create IAM policy allowing services to use key
  4. Configure buckets and resources to use customer key

3. Enable Comprehensive Logging

Audit Logs: Enable audit logging for all API calls to Generative AI services
Flow Logs: Monitor network traffic to/from GPU clusters
Application Logs: Capture inference requests and responses for analysis

4. Leverage Network Isolation

Private Endpoints: Deploy model endpoints in private subnets
Service Gateway: Access Oracle services without internet exposure
Bastion Hosts: Secure administrative access to resources

5. Implement Defense-in-Depth

Multiple Layers:

  • Network security (NSGs, Security Lists)
  • Identity security (IAM, MFA)
  • Data security (encryption at rest and in transit)
  • Application security (input validation, rate limiting)
  • Monitoring security (Cloud Guard, Logging Analytics)

6. Regular Security Reviews

Periodic Assessments:

  • Review IAM policies quarterly
  • Rotate encryption keys according to policy
  • Audit user access and remove unused accounts
  • Update network security rules based on traffic patterns

7. Secure Development Practices

The architecture includes a CI/CD pipeline for promoting models from the playground environment to production.

Separation of Environments:

  • Development: Unrestricted experimentation with synthetic data
  • Staging: Production-like with access controls
  • Production: Maximum security with customer data

Monitoring and Incident Response

Real-Time Monitoring

Metrics to Track:

  • API request rates and latency
  • Authentication failures
  • Unusual access patterns
  • Resource utilization anomalies

Alerting:
Configure Cloud Events to trigger notifications for:

  • IAM policy changes
  • Network security group modifications
  • Encryption key access
  • Dedicated cluster scaling events

Incident Response

Preparation:

  • Define incident response procedures
  • Establish communication channels
  • Designate response team members

Detection:

  • Cloud Guard active monitoring
  • Log analysis for anomalies
  • User-reported issues

Containment:

  • Isolate affected resources
  • Revoke compromised credentials
  • Block suspicious network traffic

Recovery:

  • Restore from backups if necessary
  • Rotate encryption keys
  • Update security configurations

Lessons Learned:

  • Document incident timeline
  • Identify root causes
  • Update procedures and controls

Future Security Enhancements

Oracle continues investing in security capabilities:

AI-Powered Security:
Machine learning models detecting anomalous behavior in API usage patterns and access patterns.

Zero Trust Architecture:
Moving toward continuous verification of all access requests regardless of source.

Confidential Computing:
NVIDIA BlueField-3 DPUs accelerate networking, storage, and security workloads, enabling hardware-based isolation for sensitive computations.

Enhanced Compliance:
Expanding certifications and attestations for emerging regulatory requirements.

OCI Generative AI security is built on a comprehensive, multi-layered approach that addresses the unique challenges of enterprise AI deployments:

Infrastructure Security:

  • Dedicated GPUs pooled within RDMA networks exclusively for single customers
  • Dedicated GPU clusters handling only customer's base and fine-tuned models
  • Ultra-low latency RDMA networking with complete tenant isolation

Data Security:

  • Customer data access restricted within tenancy
  • Object Storage encrypted by default with periodic key rotation
  • Customer-managed keys for maximum control

Identity and Access:

  • Comprehensive IAM with fine-grained policies
  • Integration with enterprise identity providers
  • Multi-factor authentication support

Compliance:

  • FIPS 140-2 Level 3 certified HSMs
  • Sovereign cloud options for data residency
  • Extensive regulatory certifications

Security and privacy of customer workloads is truly an essential design tenet at OCI. By combining dedicated infrastructure, comprehensive encryption, identity controls, and continuous monitoring, OCI Generative AI provides enterprise-grade security that enables organizations to confidently deploy AI at scale.

How does your organization approach AI security? What security features are most critical for your use cases? Share your experiences in the comments

Top comments (0)