Ryan Giggs

Posted on Jan 17

OCI Generative AI Security: Dedicated GPUs, RDMA Networks, and Enterprise-Grade Data Protection

#ocigenai #cloudsecurity #aigovernance #oracle

Security and privacy of customer workloads is an essential design tenet for Oracle Cloud Infrastructure (OCI) Generative AI. In an era where data breaches and AI security vulnerabilities dominate headlines, OCI has architected a comprehensive security framework that ensures enterprise-grade protection for AI workloads. This deep dive explores the multi-layered security approach that makes OCI Generative AI one of the most secure platforms for enterprise AI deployments.

The Foundation: GPU Isolation and Dedicated Infrastructure

Dedicated GPU Allocation

GPUs allocated for a customer's generative AI tasks are pooled within a dedicated RDMA network, ensuring they are exclusively allocated to a single customer and not shared with others. This fundamental isolation guarantees that customer data remains secure and inaccessible to unauthorized parties.

Key Isolation Characteristics:

Physical Separation: Each customer's GPU cluster operates on dedicated hardware, preventing any cross-contamination or data leakage between tenants.

Network Isolation: RDMA cluster networks with less than 10 microsecond latency provide ultra-high-bandwidth connections (1.6 TB/sec internode bandwidth) while maintaining complete isolation from other customers' workloads.

Compute Isolation: OCI Supercluster can deploy up to 32,768 GPUs per cluster, all dedicated to a single customer with no sharing of computational resources.

RDMA Network Architecture

Remote Direct Memory Access (RDMA) is a critical component of OCI's AI infrastructure, providing both performance and security benefits.

RDMA Technology Overview:

RDMA allows for low-latency connections between nodes and access to GPU memory without involving the CPU. This technology enables:

Ultra-Low Latency: Less than 10 microseconds between nodes
High Bandwidth: 1.6 TB/sec internode bandwidth for massive data transfers
CPU Offload: Direct memory-to-memory transfers without CPU intervention
Security: Dedicated network fabric preventing cross-tenant access

Network Types:

OCI's cluster network uses RDMA over Converged Ethernet Version 2 (RoCE v2) on top of NVIDIA ConnectX-7 network interface cards (NICs) to support high-throughput and latency-sensitive workloads.

For advanced deployments, OCI utilizes NVIDIA Quantum InfiniBand or NVIDIA Spectrum-X Ethernet with RDMA over Converged Ethernet (RoCE) configuration.

Scale and Performance:

OCI enables customers to cluster up to 4,096 bare metal nodes, each with 8 GPUs, totaling up to 32,768 GPUs. The latest infrastructure scales even further—zettascale OCI Supercluster scales up to 131,072 GPUs, making it the largest hyperscale AI supercomputer in the cloud.

Hardware Security Modules (HSM)

Keys are stored on Hardware Security Modules (HSMs) that meet Federal Information Processing Standards (FIPS) 140-2 Security Level 3 security certification.

HSM Protection Modes:

Software Protection: Keys stored and processed on servers, recommended for most use cases with strong encryption.

HSM Protection: Keys stored on HSMs meeting FIPS 140-2 Security Level 3 certification, recommended for stringent compliance requirements like financial services and healthcare.

Model Endpoint Security

Single-Customer Model Endpoints

For strong data privacy and security, a dedicated GPU cluster only handles fine-tuned models for a single customer. This architectural decision ensures:

Data Isolation: No model from one customer processes data from another customer
Performance Isolation: Resources allocated exclusively for your workloads
Compliance: Easier to meet regulatory requirements with clear tenant boundaries
Security Boundaries: Physical and logical separation of processing environments

Endpoint Access Control

Model endpoints implement multiple layers of access control:

Authentication: API keys, OAuth tokens, or IAM credentials required for all requests
Authorization: Role-Based Access Control (RBAC) determines what actions authenticated users can perform
Network Security: Private endpoints within Virtual Cloud Networks (VCNs), with optional public access through controlled ingress
Rate Limiting: Configurable throttling to prevent abuse and ensure fair resource allocation

Customer Data and Model Isolation

Tenancy-Level Isolation

Customer data access is restricted within a customer's tenancy so that one customer's data cannot be seen by another customer. This fundamental principle manifests across multiple dimensions:

Compartment Isolation:

OCI uses compartments to organize and isolate resources. The architecture implements compartmentalization and private subnets to isolate different operational environments.

Best Practice: Create separate compartments for:

Development environments
Staging/testing
Production workloads
Different business units or projects

Data Separation:

Training data, fine-tuned model weights, embeddings, and inference logs are all stored within customer-controlled storage with strict access controls preventing cross-tenant access.

Workload Isolation:

A single NVIDIA GB200 NVL72 rack can be configured to launch multiple NVLink groupings for smaller workloads and provide strong isolation for efficient workload distribution.

Network-Level Isolation

Private Subnets:

Resources deployed in private subnets within VCNs, preventing direct internet access unless explicitly configured.

Security Lists and Network Security Groups (NSGs):

Stateful firewalls controlling inbound and outbound traffic at the subnet and network interface level.

Service Gateway and NAT Gateway:

Controlled access to Oracle services and internet resources without exposing instances to inbound internet traffic.

Integrated OCI Security Services

OCI Generative AI leverages Oracle's comprehensive security ecosystem to provide defense-in-depth.

OCI Identity and Access Management (IAM)

IAM provides the foundation for authentication and authorization across all OCI services.

Key IAM Capabilities:

User and Group Management: Create users, organize them into groups, and assign permissions based on job functions.

Policies: Stringent access controls using OCI IAM policies enforce least-privilege access.

Example IAM Policy for Generative AI:

Allow group GenAI-Engineers to manage generative-ai-family in compartment GenAI-Prod
Allow group GenAI-Engineers to use virtual-network-family in compartment GenAI-Networking
Allow group Data-Scientists to read generative-ai-endpoints in compartment GenAI-Prod

Identity Providers: Integration with corporate identity systems (SAML, SCIM) for single sign-on.

Multi-Factor Authentication (MFA): Additional security layer requiring multiple forms of verification.

OCI Key Management Service (Vault)

The Key Management Service is critical for protecting data at rest and managing encryption keys.

Vault Architecture:

Oracle Vault is a logical grouping of keys. There are two types of vaults: Private and Virtual, which have different levels of isolation, pricing, and computing.

Key Features:

Centralized Key Management: OCI Key Management provides centralized management of the encryption of your data.

Customer-Managed Keys (CMK): Customers create and control their own encryption keys, maintaining full ownership.

Oracle-Managed Keys: Default encryption with Oracle-managed keys that are automatically rotated.

Key Rotation: Each master encryption key (MEK) is automatically assigned a key version. When you rotate a key, the Vault service generates a new key version.

Integration with Generative AI:

OCI Generative AI uses Key Management Service to:

Store Base Model Keys Securely: Encryption keys for foundational models are managed through Vault
Encrypt Fine-Tuned Models: Custom model weights encrypted with customer-controlled keys
Protect Training Data: Datasets used for fine-tuning encrypted before storage
Secure Inference Data: Input and output data encrypted in transit and at rest

OCI Object Storage Security

Object Storage buckets store critical AI assets including training data, model weights, and embeddings.

Default Encryption:

OCI encrypts all objects by default with Oracle-managed keys, which are periodically rotated. This ensures data is protected even if customers don't explicitly configure encryption.

Customer-Managed Encryption:

OCI provides options for customers to use their own keys for more security. Using IAM policies that authenticate users performing tasks, a best practice is creating and periodically rotating encryption keys in the Vault service to protect resources in Object Storage.

Encryption at Multiple Layers:

Object-Level Encryption: Each object encrypted with a unique Data Encryption Key (DEK)
Bucket-Level Encryption: Entire buckets can be encrypted with a specific Master Encryption Key (MEK)
Client-Side Encryption: Customers can encrypt data before sending it to Oracle Object Storage for additional control

Required IAM Policies:

To use customer-managed keys with Object Storage, create a policy like:

allow service objectstorage-<region_name> to use keys in compartment <compartment_name>

Access Control:

Pre-Authenticated Requests (PARs): Time-bound access to objects without requiring IAM credentials
Bucket Visibility: Private by default, with explicit configuration required for public access
Versioning: Track and restore previous versions of objects

Security Best Practices:

Restrict BUCKET_UPDATE permission to a minimal set of IAM groups to minimize the possibility of existing buckets being made public inadvertently or maliciously.

Cloud Guard: Active Threat Detection

Comprehensive logging and audit trails maintained for all significant operations.

Cloud Guard continuously monitors your OCI environment for security weaknesses and threats.

Detector Rules for Object Storage:

Cloud Guard includes detector rules for Object Storage to identify public buckets, unencrypted data, and other security issues.

Capabilities:

Automated Detection: Identifies misconfigurations and security threats in real-time
Responder Actions: Automatically remediate issues or notify administrators
Problem Dashboard: Centralized view of all detected security problems
Integration: Works across all OCI services including Generative AI resources

Data Protection Throughout the AI Lifecycle

Training Data Security

Ingestion:

Data uploaded to Object Storage with automatic encryption
Customer-managed keys available for additional control
Access restricted via IAM policies

Processing:

Training occurs on dedicated GPUs isolated within customer's RDMA network
No data sharing between customers
Temporary data purged after training completion

Storage:

Fine-tuned model weights stored in Object Storage buckets
Encrypted by default and managed by Key Management Service
Version control for model iterations

Inference Data Security

Input Processing:

Requests authenticated via IAM or API keys
Data encrypted in transit using TLS 1.2+
Processed on dedicated endpoints with no cross-tenant access

Output Generation:

Responses encrypted in transit
Optional logging with encryption at rest
Retention policies configurable per use case

Monitoring:

Continuous monitoring of resource usage to proactively mitigate contention
Audit logs for all API calls
Integration with OCI Logging Analytics

Advanced Security Features

Security Zones

Security Zones provide compartment-associated Oracle-defined recipes of security policies based on best practices.

Enforced Policies:

No Public Access: Resources cannot be accessible from public internet
Customer-Managed Keys: Data encryption enforced using Customer-Managed Keys for block volumes, boot volumes, and Object Storage buckets
Deny Public Buckets: Public access to Object Storage buckets is denied to prevent accidental exposure

Use Case:

Deploy Generative AI workloads in Security Zones to automatically enforce security best practices and maintain compliance.

Compliance and Certifications

OCI maintains extensive compliance certifications relevant to AI workloads:

Regional Compliance:

GDPR: European data protection requirements
HIPAA: Healthcare data in the United States
FedRAMP: U.S. government cloud security
ISO 27001/27017/27018: International security standards

Industry-Specific:

PCI DSS: Payment card industry
SOC 1/2/3: Service organization controls
FIPS 140-2 Level 3: Cryptographic module security

Sovereign Cloud Options

Oracle's distributed cloud, AI infrastructure, and generative AI services enable governments and enterprises to deploy AI factories that run cloud services locally and within a country's secure premises with operational controls supporting sovereign goals.

Available Options:

OCI Dedicated Region: Entire cloud region operated exclusively for a single customer
Oracle Alloy: Cloud platform for partners to deliver sovereign cloud
Oracle EU Sovereign Cloud: European data residency and operational sovereignty
Oracle Government Cloud: FedRAMP-authorized regions for U.S. government

Best Practices for Secure Generative AI Deployments

1. Implement Least-Privilege Access

Principle: Grant only the minimum permissions necessary for each role.

Implementation:

# Development team - read-only on production
Allow group GenAI-Developers to read generative-ai-family in compartment GenAI-Prod

# Production team - full management
Allow group GenAI-Production to manage generative-ai-family in compartment GenAI-Prod

2. Use Customer-Managed Keys

Benefit: Full control over encryption key lifecycle and usage.

Steps:

Create Vault in Key Management Service
Generate or import master encryption key
Create IAM policy allowing services to use key
Configure buckets and resources to use customer key

3. Enable Comprehensive Logging

Audit Logs: Enable audit logging for all API calls to Generative AI services
Flow Logs: Monitor network traffic to/from GPU clusters
Application Logs: Capture inference requests and responses for analysis

4. Leverage Network Isolation

Private Endpoints: Deploy model endpoints in private subnets
Service Gateway: Access Oracle services without internet exposure
Bastion Hosts: Secure administrative access to resources

5. Implement Defense-in-Depth

Multiple Layers:

Network security (NSGs, Security Lists)
Identity security (IAM, MFA)
Data security (encryption at rest and in transit)
Application security (input validation, rate limiting)
Monitoring security (Cloud Guard, Logging Analytics)

6. Regular Security Reviews

Periodic Assessments:

Review IAM policies quarterly
Rotate encryption keys according to policy
Audit user access and remove unused accounts
Update network security rules based on traffic patterns

7. Secure Development Practices

The architecture includes a CI/CD pipeline for promoting models from the playground environment to production.

Separation of Environments:

Development: Unrestricted experimentation with synthetic data
Staging: Production-like with access controls
Production: Maximum security with customer data

Monitoring and Incident Response

Real-Time Monitoring

Metrics to Track:

API request rates and latency
Authentication failures
Unusual access patterns
Resource utilization anomalies

Alerting:
Configure Cloud Events to trigger notifications for:

IAM policy changes
Network security group modifications
Encryption key access
Dedicated cluster scaling events

Incident Response

Preparation:

Define incident response procedures
Establish communication channels
Designate response team members

Detection:

Cloud Guard active monitoring
Log analysis for anomalies
User-reported issues

Containment:

Isolate affected resources
Revoke compromised credentials
Block suspicious network traffic

Recovery:

Restore from backups if necessary
Rotate encryption keys
Update security configurations

Lessons Learned:

Document incident timeline
Identify root causes
Update procedures and controls

Future Security Enhancements

Oracle continues investing in security capabilities:

AI-Powered Security:
Machine learning models detecting anomalous behavior in API usage patterns and access patterns.

Zero Trust Architecture:
Moving toward continuous verification of all access requests regardless of source.

Confidential Computing:
NVIDIA BlueField-3 DPUs accelerate networking, storage, and security workloads, enabling hardware-based isolation for sensitive computations.

Enhanced Compliance:
Expanding certifications and attestations for emerging regulatory requirements.

OCI Generative AI security is built on a comprehensive, multi-layered approach that addresses the unique challenges of enterprise AI deployments:

Infrastructure Security:

Dedicated GPUs pooled within RDMA networks exclusively for single customers
Dedicated GPU clusters handling only customer's base and fine-tuned models
Ultra-low latency RDMA networking with complete tenant isolation

Data Security:

Customer data access restricted within tenancy
Object Storage encrypted by default with periodic key rotation
Customer-managed keys for maximum control

Identity and Access:

Comprehensive IAM with fine-grained policies
Integration with enterprise identity providers
Multi-factor authentication support

Compliance:

FIPS 140-2 Level 3 certified HSMs
Sovereign cloud options for data residency
Extensive regulatory certifications

Security and privacy of customer workloads is truly an essential design tenet at OCI. By combining dedicated infrastructure, comprehensive encryption, identity controls, and continuous monitoring, OCI Generative AI provides enterprise-grade security that enables organizations to confidently deploy AI at scale.

How does your organization approach AI security? What security features are most critical for your use cases? Share your experiences in the comments

DEV Community