In building a multi-tenant SaaS platform for vulnerability management, the backend architecture must efficiently process diverse JSON vulnerability data from tools like Prowler, Trivy, AWS Inspector, and Kubehunter. The system requires secure data ingestion, normalization to a custom schema, optional AI-driven embeddings for critical vulnerabilities using Amazon Bedrock, and storage in Amazon RDS for real-time dashboard queries and LLM-based remediation. After evaluating multiple serverless approaches, a design leveraging AWS Step Functions with Lambda tasks emerged as the optimal solution. This article explores the chosen architecture, compares it against alternatives—purely Lambda-based processing, Step Functions with Lambda, and Step Functions with Glue—and details the rationale for selecting Step Functions with Lambda, demonstrating a balanced approach to cost, scalability, reliability, and simplicity for an MVP.
Problem Statement and Requirements
The platform must handle variable JSON structures from vulnerability scans, ensuring:
- Secure Ingestion: Support pre-signed URLs for direct S3 uploads to avoid backend bottlenecks.
-
Metadata Tracking: Store upload details (e.g.,
customer_id
,source_tool
,embedding_flag
) for auditing and idempotency. - Processing Pipeline: Validate uploads, preprocess/normalize data, check for existing customers and deltas, embed critical vulnerabilities, upsert to RDS, validate writes, log metrics, and clean up temporary data.
- Error Handling: Use a Dead Letter Queue (DLQ) for robust failure recovery.
- Post-Processing: Enable optional notifications or analysis triggers.
- Constraints: Target ~$14-23/month for 900 uploads/month across 10 tenants, ensure multi-tenant isolation, support real-time dashboard queries, and scale to 50+ tenants.
The architecture evolved from initial considerations of AWS Glue for ETL to Step Functions for its flexibility in handling conditional logic and tool-specific branching.
Chosen Architecture Overview
The selected architecture is fully serverless, using AWS Step Functions to orchestrate Lambda tasks for a streamlined workflow:
- Frontend Upload: API Gateway handles upload requests, invoking a PreSignedUrl Lambda to generate S3 URLs and store metadata in DynamoDB.
- Event Triggering: S3 uploads trigger EventBridge, which directly invokes Step Functions with the event payload (bucket, key).
-
Workflow Orchestration: Step Functions coordinates tasks:
- Idempotency Check: Skips processed files (status = 'processed').
-
Metadata Hydration/Validation: Fetches and validates
customer_id
,source_tool
,embedding_flag
from DynamoDB. - Customer/Delta Checks: Verifies customer existence in RDS and identifies new/patched vulns.
- Tool-Specific Preprocessing: Branches for Prowler, Inspector, Trivy, Kubehunter to validate, deduplicate, and normalize JSON.
- Data Validation: Ensures required fields and row counts.
- Batch Upsert: Writes normalized data to RDS (PostgreSQL).
- Post-Upsert Validation: Verifies write success (e.g., row count match).
-
Conditional Embedding: Fetches critical vulns (if
embedding_flag = true
andseverity = 'critical'
), generates Bedrock embeddings, and updates RDS. - Metrics Logging: Logs processed vuln counts to CloudWatch.
- Cleanup: Deletes S3 files and DynamoDB entries.
- Error Handling: Errors route to a DLQ for analysis.
- Post-Processing: RDS updates trigger an optional Agentic Auto Scaling Group (ASG) for notifications.
This design ensures asynchronous processing, decoupling uploads from computation, with conditional embedding to optimize AI costs. Estimated MVP cost is ~$14-23/month.
Comparison of Architecture Alternatives
Three serverless architectures were evaluated to meet the platform’s requirements. Each is assessed on cost, scalability, reliability, operational complexity, and suitability for conditional logic and multi-tool branching.
1. Purely Lambda-Based Processing
Description: A single Lambda function (or chained Lambdas) handles the entire workflow: downloading from S3, preprocessing, normalizing, embedding, and upserting to RDS. Metadata is stored in DynamoDB, with errors logged to CloudWatch or a DLQ.
- Cost: Low (~$0.06/month for 900 invocations, 128 MB memory, 5-second duration). Pay-per-request billing suits sporadic uploads.
- Scalability: Excellent, with Lambda auto-scaling to thousands of concurrent executions, supporting growing tenants without reconfiguration.
- Reliability: Moderate. Built-in retries (up to 3 attempts) handle transient failures, but complex flows (e.g., tool branching, conditionals) require custom error handling, risking uncaught exceptions.
- Operational Complexity: Medium. Sequencing, branching, and retries must be coded manually, leading to monolithic or fragile Lambda chains. Monitoring requires custom CloudWatch metrics, increasing development effort (~2-3 days).
- Suitability: Limited for conditional logic (e.g., embedding only critical vulns) and multi-tool branching, as these bloat Lambda code. Testing is challenging without visual orchestration, and retry costs add up (~$0.01 per failed invocation).
Drawbacks: Lacks structured orchestration, making it error-prone for complex workflows.
2. Step Functions with Lambda
Description: The chosen architecture uses Step Functions to orchestrate Lambda tasks for discrete steps: idempotency, metadata validation, tool-specific preprocessing, conditional embedding, upserting, and cleanup. EventBridge triggers the workflow directly, with DynamoDB for metadata and DLQ for errors.
- Cost: Slightly higher than pure Lambda (~$0.16/month for 900 executions, 7 transitions each at $0.000025/transition), but total ~$14-23/month including RDS (~$12.41), Bedrock (~$0.09), and other (~$1.48). Pay-per-use aligns with MVP constraints.
- Scalability: High, with Step Functions supporting 1,000+ concurrent executions and Lambdas scaling automatically. Ideal for adding tenants/tools without refactoring.
- Reliability: Strong, with built-in retries (configurable per task), error catching, and branching ensuring graceful failure handling (e.g., skipping embedding for non-critical vulns). The visual console simplifies debugging.
- Operational Complexity: Low. Step Functions’ visual editor reduces sequencing/retry code, and modular Lambdas (e.g., separate for Prowler preprocessing) enhance maintainability. Setup takes ~1-2 days.
- Suitability: Excellent for conditional logic (Choice states for embedding) and tool branching (Map/Choice states). Simplifies testing (execution traces) and extends easily (e.g., add validation tasks).
Advantages: Balances flexibility, reliability, and simplicity, making it ideal for the platform’s workflow.
3. Step Functions with Glue
Description: Step Functions orchestrates AWS Glue jobs for ETL tasks (e.g., normalization, preprocessing) and Lambda for non-ETL tasks (e.g., embedding, metadata). Glue handles JSON parsing, while Step Functions manages flow.
- Cost: Moderate (~$2-15/month for Glue Python shell jobs, 10-min daily runs) + Step Functions (~$0.16/month) = ~$2.16-15.16/month ETL. Higher than Lambda-only for low volumes.
- Scalability: Strong for big data (Glue DPUs scale with volume), but overkill for MVP’s 50 GB. Step Functions adds orchestration.
- Reliability: Good, with Glue retrying ETL tasks and Step Functions handling flow. However, Glue’s 10-min billing minimum wastes resources for small jobs.
- Operational Complexity: Medium-high. Glue’s visual editor aids ETL, but tool-specific branching and conditional embedding require custom PySpark scripts, increasing complexity (~2-3 days setup).
- Suitability: Effective for normalization (DynamicFrames handle schema variations), but less flexible for conditional embedding (Glue integrates Bedrock via SDK, but branching is cumbersome). Better suited for large-scale ETL than MVP’s moderate uploads.
Drawbacks: Adds unnecessary cost and complexity for small datasets.
Rationale for Selecting Step Functions with Lambda
The Step Functions with Lambda architecture was selected for its optimal balance of cost, scalability, reliability, and operational simplicity, tailored to the platform’s requirements.
- Cost-Effectiveness: At ~$14-23/month, it outperforms Step Functions with Glue (~$2-15/month extra due to Glue’s billing minimum) and mitigates pure Lambda’s hidden costs from custom retry logic (~$0.01/invocation). Pay-per-use billing leverages AWS Free Tier, keeping MVP costs low. For example, 900 uploads/month with 7 transitions each (~$0.16) is negligible compared to Glue’s $2-15/month.
- Scalability and Flexibility: Step Functions’ visual orchestration excels for conditional logic (e.g., Choice states to embed only critical vulns, saving ~50-80% Bedrock costs) and tool branching (e.g., Map states for parallel preprocessing). It scales serverlessly to 1,000+ executions, supporting growth to 50+ tenants, unlike pure Lambda’s monolithic code or Glue’s batch focus.
- Reliability and Error Handling: Built-in retries (3 attempts/task), error catching, and DLQ ensure robust processing (e.g., handle invalid JSON, Bedrock throttling). Pure Lambda requires manual exception handling, risking failures, while Glue’s reliability is ETL-specific.
- Operational Simplicity: Setup takes ~1-2 days with Step Functions’ visual editor and modular Lambdas, vs. ~3-5 days for EC2 or Glue-heavy flows. Execution traces simplify testing (e.g., verify RDS writes), and integration with Bedrock/RDS via Lambda SDKs is seamless, unlike Pure Lambda’s custom orchestration or Glue’s PySpark complexity.
- Future-Proofing: The workflow supports extensions like OCSF normalization for Security Lake or additional tools via new Choice branches, without disrupting the core flow. Pure Lambda would require redesign, and Glue limits non-ETL tasks.
Benefits of the Chosen Architecture
- Cost Efficiency: ~$14-23/month for 900 uploads, leveraging pay-per-use and free tiers, vs. ~$40-50/month for EC2 t4g.medium.
- Performance: Asynchronous processing (~seconds for uploads), low-latency RDS writes (~ms) for dashboard queries.
- Security: Pre-signed URLs, IAM roles, and VPC endpoints protect data; DLQ aids auditing.
- Extensibility: Modular tasks allow new tools or AI features (e.g., Bedrock agents) with minimal changes.
- Monitoring: CloudWatch/X-Ray provide visibility (~$0.50/month logs, free tier covers).
Conclusion
The Step Functions with Lambda architecture exemplifies AWS best practices, delivering a scalable, reliable, and cost-effective solution for vulnerability data processing. By prioritizing orchestration over pure Lambda’s fragility and Glue’s complexity, it meets MVP needs while enabling future growth. Developers and architects can explore AWS documentation for implementation details or share feedback to refine this approach.
Top comments (0)