Speaker: Samantha Gignac @ AWS FSI Meetup Q4/2024
Agenda
Start with the basics
What is log management and why is it critical for financial sector organizations
Its role in compliance, security, and operational efficiency
Explore unique challenges faced by financial institutions in log management
Handling large data volumes
Meeting regulatory requirements
Managing costs
Discuss Zurich's specific goals for a log management solution
- Understand the approach chosen
Dive into the technical details of the solution
Outline key AWS services used
Explain design principles for scalability, cost-effectiveness, and efficiency
Review the outcomes and benefits achieved
- Apply lessons to other financial sector organizations
Wrap up with key takeaways
- Provide actionable insights for organizations using or exploring AWS for log management
Practical and relatable, focusing on real-world strategies and inspiration for the audience's journey
Begin with the fundamentals:
- Why is log management important in the finance sector?
The basics of log management
Log sources: Logs come from servers, appliances, network devices, cloud services, etc.
Log collection: Centralizing log collection is essential for visibility and compliance
Processing and analysis: Filtering noise, adding context, and identifying patterns or red flags
Storage and retention: Secure, organized, and long-term storage (e.g., S3 tiers) is crucial for compliance
Insights and actions: Leveraging data for actionable insights and responses
Log management is crucial for financial services due to three main reasons:
Compliance: Regulators require proof of due diligence, and logs serve as evidence.
Security: Logs are essential for early threat detection and mitigation.
Operational efficiency: Smoothly running systems lead to happy customers, employees, and regulators.
Log management is not just about data collection but about transforming data into useful insights for system protection, regulation compliance, and operational efficiency.
Key pain points:
Specific challenges Zurich faced in log management:
Centralized logging can become expensive quickly.
Approach treated all logs equally, storing everything in high-cost storage and processing all logs as critical, leading to rapid cost escalation.
Data volume:
Financial services generate vast amounts of log data from various sources (firewall rules, access attempts, vulnerability scans, etc.).
Security systems, due to sensitive data handling, generate terabytes of logs daily.
Not all security logs are equally important; some can be analyzed later, while critical alerts require immediate attention.
Compliance regulations:
Regulations like PCI DSS and GDPR mandate log retention (e.g., seven years for audits) and detailed access records.
Ensuring logs are secure, tamper-proof, and accessible without excessive cost is challenging.
Cost efficiency:
- Balancing the need for comprehensive log management with cost-effective solutions is crucial.
Treating all logs equally leads to resource waste:
Not all logs require high-cost, real-time processing.
Example: One-time debug logs vs. API access logs containing security-critical information.
Need to store and retain logs based on their importance and potential use.
Complexity and integration challenges:
Logs come from various sources (legacy systems, cloud services, apps, devices) with different formats.
Translating between systems can lead to delays and blind spots.
Example: Vulnerability management system and CASB tool (Cloud Access Security Broke) using different formats.
Security and real-time analysis:
SIME (Security information and event management) logs are critical for threat detection and response.
Need to analyze logs quickly and efficiently to detect patterns (e.g., failed login attempts indicating a brute force attack).
Trust is crucial in the financial sector, so timely threat detection is essential.
Summary of challenges:
Overwhelming data volumes
Strict compliance requirements
High cost of logging
Messy integrations
Need for real-time security
Solution goals:
Modernize log management approach:
Move away from treating every log equally.
Use scalable tools to prioritize data collection based on importance.
Reduce cost:
Lower cost per gigabyte ingested without compromising performance and compliance.
Manage data smarter by filtering out unnecessary logs before ingestion.
Decommission old SIME (Security information and event management) infrastructure:
Retire legacy systems to reduce technical debt, high maintenance costs, slow performance, and limited scalability.
Free up resources for more modern and efficient solutions.
Improve analytics performance:
Reduce latency and log search time for faster insights and real-time decision-making.
Focus on delivering better performance without unnecessary delays.
Objectives summary:
Achieve smarter log management that reduces costs, simplifies the environment, and delivers better performance.
Focus on what matters most.
Solution architecture:
Prioritize logs based on importance:
High-priority logs (e.g., security events) are routed for immediate analysis.
Lower-priority logs (e.g., compliance-related) are stored in cost-effective archives like S3.
Ensures resources are focused on critical data while controlling costs.
ETL pipeline:
Acts as a traffic controller for logs, filtering out unnecessary data to reduce ingestion costs.
Enriches data with additional metadata to make logs more actionable.
Routes data to the appropriate destination (real-time analytics or long-term storage).
Critical for reducing cost per gigabyte ingested and ensuring valuable data is available where needed.
Route logs into SIME and AWS infrastructure:
- After prioritization and processing, logs are directed into the SIME (Security information and event management) and AWS infrastructure for further analysis and storage.
AWS infrastructure components and their roles:
Amazon OpenSearch Service:
Enables real-time analytics with fast log searches.
Provides dashboards and monitoring tools for DevOps and security teams.
Helps identify root causes quickly (e.g., sudden spike in failed login attempts).
Amazon S3:
Stores all logs for compliance purposes.
Scalable, secure, and cost-effective.
Perfect for compliance needs.
AWS Glue and Data Catalog:
Organizes and makes logs searchable.
Reduces time spent searching through raw data.
Amazon Athena:
Allows querying data directly in S3 without data movement.
Cost-effective for deep dive investigations and ad hoc analysis.
Useful for investigating potential compliance issues.
Combining data prioritization, ETL capabilities, and AWS services:
Modernized log management approach.
Reduced costs by storing and processing only the most important logs.
Simplified infrastructure by retiring legacy SIME (Security information and event management) systems.
Improved performance with faster log searches and better analytics.
Benefits and outcomes of the solution:
Cost savings:
Reduced SIME (Security information and event management) ingestion from 5 TB per day to 500 GB per day.
Achieved by deploying Cribl with version control and automatic backups.
Unnecessary logs were filtered out before ingestion.
Scales rapidly using Terraform.
Performance improvements:
Sample query for counting firewall events per firewall reduced from 93 seconds to 2 seconds.
OpenSearch proved to be fast and easy to manage.
One-click deployment and upgrades.
Consistent and reliable rollout using version-controlled Terraform.
Clean house and technical debt reduction:
Retired old SIME (Security information and event management) infrastructure, eliminating technical debt.
Transitioned to a sleek, modern architecture.
Scalability:
Utilized AWS services like S3 for tiered data storage (frequent access, infrequent access, and Glacier).
Achieved rapid scaling using Terraform for Cribl and other components.
The system grows with organizational needs.
Compliance and security:
Logs securely stored in S3 and easily searchable in OpenSearch.
Made regulatory reporting and security monitoring faster, simpler, and more reliable.
Key lessons learned from Journey:
Prioritize the right data:
Not all logs are equal; treat them based on importance.
Zurich achieved over 50% reduction in ingestion costs by prioritizing critical logs.
Invest in modern tools:
Use tools designed for current challenges (e.g., Cribl, AWS OpenSearch, S3).
These tools scale seamlessly with the organization.
Automate for consistency:
Utilize automation (e.g., Terraform) for deployments, upgrades, and scaling.
Reduces human error and frees teams for strategic tasks.
Think beyond compliance:
Modern log management offers faster searches, real-time insights, and better security monitoring.
Focus on making data work for the organization, not just meeting regulatory requirements.
Summary:
A smarter approach to log management is transformative.
Prioritize what matters, invest in modern tools, automate processes, and think beyond compliance.
Future plans for log management:
Continue focusing on cost savings:
Refine S3 tiering strategies to optimize storage efficiency.
Shift logs between frequent, infrequent, and Glacier tiers based on usage.
Expand automation:
Further streamline scaling and configuration updates using Terraform.
Aim for infrastructure that practically runs itself.
Integrate advanced analytics:
Utilize OpenSearch's built-in machine learning capabilities like anomaly detection.
Catch unusual patterns in log data before they escalate.
Import and optimize SIME correlation searches and alerts:
Continue transitioning from old legacy SIME (Security information and event management) by importing and optimizing these into OpenSearch.
Maintain high monitoring standards while embracing the modern platform.
Strengthen data governance:
Enhance metadata management with tools like AWS Glue.
Keep logs organized and stay ahead of compliance requirements.
Explore generative AI with OpenSearch:
Use OpenSearch as a vector database to power retrieval augmented generation (RAG) for AI use cases.
Example: Train AI models using logs to predict system issues before they happen.
Top comments (0)