DEV Community

Eliana Lam
Eliana Lam

Posted on • Originally published at aws-user-group.com

Zurich Insurance Group: Building an Effective Log Management Solution on AWS

Speaker: Samantha Gignac @ AWS FSI Meetup Q4/2024



Agenda

Start with the basics

  • What is log management and why is it critical for financial sector organizations

  • Its role in compliance, security, and operational efficiency

Explore unique challenges faced by financial institutions in log management

  • Handling large data volumes

  • Meeting regulatory requirements

  • Managing costs

Discuss Zurich's specific goals for a log management solution

  • Understand the approach chosen

Dive into the technical details of the solution

  • Outline key AWS services used

  • Explain design principles for scalability, cost-effectiveness, and efficiency

Review the outcomes and benefits achieved

  • Apply lessons to other financial sector organizations

Wrap up with key takeaways

  • Provide actionable insights for organizations using or exploring AWS for log management


Practical and relatable, focusing on real-world strategies and inspiration for the audience's journey

Begin with the fundamentals: 

  • Why is log management important in the finance sector?

The basics of log management

  • Log sources: Logs come from servers, appliances, network devices, cloud services, etc.

  • Log collection: Centralizing log collection is essential for visibility and compliance

  • Processing and analysis: Filtering noise, adding context, and identifying patterns or red flags

  • Storage and retention: Secure, organized, and long-term storage (e.g., S3 tiers) is crucial for compliance

  • Insights and actions: Leveraging data for actionable insights and responses

Log management is crucial for financial services due to three main reasons:

  • Compliance: Regulators require proof of due diligence, and logs serve as evidence.

  • Security: Logs are essential for early threat detection and mitigation.

  • Operational efficiency: Smoothly running systems lead to happy customers, employees, and regulators.

  • Log management is not just about data collection but about transforming data into useful insights for system protection, regulation compliance, and operational efficiency.



Key pain points:

Specific challenges Zurich faced in log management:

  • Centralized logging can become expensive quickly.

  • Approach treated all logs equally, storing everything in high-cost storage and processing all logs as critical, leading to rapid cost escalation.

Data volume:

  • Financial services generate vast amounts of log data from various sources (firewall rules, access attempts, vulnerability scans, etc.).

  • Security systems, due to sensitive data handling, generate terabytes of logs daily.

  • Not all security logs are equally important; some can be analyzed later, while critical alerts require immediate attention.

Compliance regulations:

  • Regulations like PCI DSS and GDPR mandate log retention (e.g., seven years for audits) and detailed access records.

  • Ensuring logs are secure, tamper-proof, and accessible without excessive cost is challenging.

Cost efficiency:

  • Balancing the need for comprehensive log management with cost-effective solutions is crucial.


Treating all logs equally leads to resource waste:

  • Not all logs require high-cost, real-time processing.

  • Example: One-time debug logs vs. API access logs containing security-critical information.

  • Need to store and retain logs based on their importance and potential use.

Complexity and integration challenges:

  • Logs come from various sources (legacy systems, cloud services, apps, devices) with different formats.

  • Translating between systems can lead to delays and blind spots.

  • Example: Vulnerability management system and CASB tool (Cloud Access Security Broke) using different formats.

Security and real-time analysis:

  • SIME (Security information and event management) logs are critical for threat detection and response.

  • Need to analyze logs quickly and efficiently to detect patterns (e.g., failed login attempts indicating a brute force attack).

  • Trust is crucial in the financial sector, so timely threat detection is essential.

Summary of challenges:

  • Overwhelming data volumes

  • Strict compliance requirements

  • High cost of logging

  • Messy integrations

  • Need for real-time security

Solution goals:

  • Modernize log management approach:

  • Move away from treating every log equally.

  • Use scalable tools to prioritize data collection based on importance.

Reduce cost:

  • Lower cost per gigabyte ingested without compromising performance and compliance.

  • Manage data smarter by filtering out unnecessary logs before ingestion.



Decommission old SIME (Security information and event management) infrastructure:

  • Retire legacy systems to reduce technical debt, high maintenance costs, slow performance, and limited scalability.

  • Free up resources for more modern and efficient solutions.

Improve analytics performance:

  • Reduce latency and log search time for faster insights and real-time decision-making.

  • Focus on delivering better performance without unnecessary delays.

Objectives summary:

  • Achieve smarter log management that reduces costs, simplifies the environment, and delivers better performance.

  • Focus on what matters most.



Solution architecture:

Prioritize logs based on importance:

  • High-priority logs (e.g., security events) are routed for immediate analysis.

  • Lower-priority logs (e.g., compliance-related) are stored in cost-effective archives like S3.

  • Ensures resources are focused on critical data while controlling costs.

ETL pipeline:

  • Acts as a traffic controller for logs, filtering out unnecessary data to reduce ingestion costs.

  • Enriches data with additional metadata to make logs more actionable.

  • Routes data to the appropriate destination (real-time analytics or long-term storage).

  • Critical for reducing cost per gigabyte ingested and ensuring valuable data is available where needed.

Route logs into SIME and AWS infrastructure:

  • After prioritization and processing, logs are directed into the SIME (Security information and event management) and AWS infrastructure for further analysis and storage.


AWS infrastructure components and their roles:

Amazon OpenSearch Service:

  • Enables real-time analytics with fast log searches.

  • Provides dashboards and monitoring tools for DevOps and security teams.

  • Helps identify root causes quickly (e.g., sudden spike in failed login attempts).

Amazon S3:

  • Stores all logs for compliance purposes.

  • Scalable, secure, and cost-effective.

  • Perfect for compliance needs.

AWS Glue and Data Catalog:

  • Organizes and makes logs searchable.

  • Reduces time spent searching through raw data.

Amazon Athena:

  • Allows querying data directly in S3 without data movement.

  • Cost-effective for deep dive investigations and ad hoc analysis.

  • Useful for investigating potential compliance issues.



Combining data prioritization, ETL capabilities, and AWS services:

  • Modernized log management approach.

  • Reduced costs by storing and processing only the most important logs.

  • Simplified infrastructure by retiring legacy SIME (Security information and event management) systems.

  • Improved performance with faster log searches and better analytics.

Benefits and outcomes of the solution:

Cost savings:

  • Reduced SIME (Security information and event management) ingestion from 5 TB per day to 500 GB per day.

  • Achieved by deploying Cribl with version control and automatic backups.

  • Unnecessary logs were filtered out before ingestion.

  • Scales rapidly using Terraform.

Performance improvements:

  • Sample query for counting firewall events per firewall reduced from 93 seconds to 2 seconds.

  • OpenSearch proved to be fast and easy to manage.

  • One-click deployment and upgrades.

  • Consistent and reliable rollout using version-controlled Terraform.

Clean house and technical debt reduction:

  • Retired old SIME (Security information and event management) infrastructure, eliminating technical debt.

  • Transitioned to a sleek, modern architecture.

Scalability:

  • Utilized AWS services like S3 for tiered data storage (frequent access, infrequent access, and Glacier).

  • Achieved rapid scaling using Terraform for Cribl and other components.

  • The system grows with organizational needs.

Compliance and security:

  • Logs securely stored in S3 and easily searchable in OpenSearch.

  • Made regulatory reporting and security monitoring faster, simpler, and more reliable.



Key lessons learned from Journey:

Prioritize the right data:

  • Not all logs are equal; treat them based on importance.

  • Zurich achieved over 50% reduction in ingestion costs by prioritizing critical logs.

Invest in modern tools:

  • Use tools designed for current challenges (e.g., Cribl, AWS OpenSearch, S3).

  • These tools scale seamlessly with the organization.

Automate for consistency:

  • Utilize automation (e.g., Terraform) for deployments, upgrades, and scaling.

  • Reduces human error and frees teams for strategic tasks.

Think beyond compliance:

  • Modern log management offers faster searches, real-time insights, and better security monitoring.

  • Focus on making data work for the organization, not just meeting regulatory requirements.

Summary:

  • A smarter approach to log management is transformative.

  • Prioritize what matters, invest in modern tools, automate processes, and think beyond compliance.



Future plans for log management:

Continue focusing on cost savings:

  • Refine S3 tiering strategies to optimize storage efficiency.

  • Shift logs between frequent, infrequent, and Glacier tiers based on usage.

Expand automation:

  • Further streamline scaling and configuration updates using Terraform.

  • Aim for infrastructure that practically runs itself.

Integrate advanced analytics:

  • Utilize OpenSearch's built-in machine learning capabilities like anomaly detection.

  • Catch unusual patterns in log data before they escalate.

Import and optimize SIME correlation searches and alerts:

  • Continue transitioning from old legacy SIME (Security information and event management) by importing and optimizing these into OpenSearch.

  • Maintain high monitoring standards while embracing the modern platform.

Strengthen data governance:

  • Enhance metadata management with tools like AWS Glue.

  • Keep logs organized and stay ahead of compliance requirements.

Explore generative AI with OpenSearch:

  • Use OpenSearch as a vector database to power retrieval augmented generation (RAG) for AI use cases.

  • Example: Train AI models using logs to predict system issues before they happen.

Top comments (0)