DEV Community: Ibrahim Salami

How Ragie Outperformed the FinanceBench Test

Ibrahim Salami — Wed, 23 Oct 2024 16:22:01 +0000

In this article, we’ll walk you through how Ragie handled the ingestion of over 50,000+ pages in the FinanceBench dataset (360 PDF files, each roughly 150-250 pages long) in just 4 hours and outperformed the benchmarks in key areas like the Shared Store configuration, where we beat the benchmark by 42%.

For those unfamiliar, the FinanceBench is a rigorous benchmark designed to evaluate RAG systems using real-world financial documents, such as 10-K filings and earnings reports from public companies. These documents are dense, often spanning hundreds of pages, and include a mixture of structured data like tables and charts with unstructured text, making it a challenge for RAG systems to ingest, retrieve, and generate accurate answers.

In the FinanceBench test, RAG systems are tasked with answering real-world financial questions by retrieving relevant information from a dataset of 360 PDFs. The retrieved chunks are fed into a large language model (LLM) to generate the final answer. This test pushes RAG systems to their limits, requiring accurate retrieval across a vast dataset and precise generation from complex financial data.

The Complexity of Document Ingestion in FinanceBench

Ingesting complex financial documents at scale is a critical challenge in the FinanceBench test. These filings contain crucial financial information, legal jargon, and multi-modal content, and they require advanced ingestion capabilities to ensure accurate retrieval.

Document Size and Format Complexity: Financial datasets consist of structured tables and unstructured text, requiring a robust ingestion pipeline capable of parsing and processing both data types.
Handling Large Documents: The 10-K can be overwhelming as the document often exceeds 150 pages, so your RAG system must efficiently manage thousands of pages and ensure that ingestion speed does not compromise accuracy (a tough capability to build).

‍How we Evaluated Ragie using the FinanceBench test

The RAG system was tasked with answering 150 complex real-world financial questions. This rigorous evaluation process was pivotal in understanding how effectively Ragie could retrieve and generate answers compared to the gold answers set by human annotators.

Each entry features a question (e.g., "Did AMD report customer concentration in FY22?"), the corresponding answer (e.g., “Yes, one customer accounted for 16% of consolidated net revenue”), and an evidence string that provides the necessary information to verify the accuracy of the answer, along with the relevant document's page number.

Grading Criteria:

Accuracy: Matching the gold answers for correct responses.
Refusals: Cases where the LLM avoided answering, reducing the likelihood of hallucinations.
Inaccurate Responses: Instances where incorrect answers were generated.

Ragie’s Performance vs. FinanceBench Benchmarks

We evaluated Ragie across two configurations:

Single-Store Retrieval: In this setup, the vector database contains chunks from a single document, and retrieval is limited to that document. Despite being simpler, this setup still presents challenges when dealing with large, complex financial filings.

We matched the benchmark for Single Vector Store retrieval, achieving 51% accuracy using the setup below:

Top\_k=32, No rerank

Shared Store Retrieval: In this more complex setup, the vector database contains chunks from all 360 documents, requiring retrieval across the entire dataset. Ragie had a 27% accuracy compared to the benchmark of 19% for Shared Store retrieval, outperforming the benchmark by 42% using this setup:

Top\_k=8, No rerank

The Shared Store retrieval is a more challenging task since retrieval happens across all documents simultaneously; ensuring relevance and precision becomes significantly more difficult because the RAG system needs to manage content from various sources and maintain high retrieval accuracy despite the larger scope of data.

Key Insights:

In a second Single Store run with top_k=8, we ran two tests with rerank on and off:
- Without rerank, the test was 50% correct, 32% refusals, and 18% incorrect answers.
- With rerank on, the test was 50% correct, but refusals increased to 37%, and incorrect answers dropped to 13%.
- Conclusion: Reranking effectively reduced hallucinations by 16%
There was no significant difference between GPT-4o and GPT-4 Turbo’s performance during this test.

Why Ragie Outperforms: The Technical Advantages

Advanced Ingestion Process: Ragie's advanced extraction in hi_res mode enables it to extract all the information from the PDFs using a multi-step extraction process described below:
- Text Extraction: Firstly, we efficiently extract text from PDFs during ingestion to retain the core information.**
- Tables and Figures: For more complex elements like tables and images, we use advanced optical character recognition (OCR) techniques to extract structured data accurately.
- LLM Vision Models: Ragie also uses LLM vision models to generate descriptions for images, charts, and other non-text elements. This adds a semantic layer to the extraction process, making the ingested data richer and more contextually relevant.
Hybrid Search: We use hybrid search by default, which gives you the power of semantic search (for understanding context) and keyword-based retrieval (for capturing exact terms). This dual approach ensures precision and recall. For example, financial jargon will have a different weight in the FinanceBench dataset, significantly improving the relevance of retrievals.
Scalable Architecture: While many RAG systems experience performance degradation as dataset size increases, Ragie’s architecture maintains high performance even with 50,000+ pages. Ragie also uses summary index for hierarchical and hybrid hierarchical search; this enhances the chunk retrieval process by processing chunks in layers and ensuring that context is preserved to retrieve highly relevant chunks for generations.

Conclusion

Before making a Build vs Buy decision, developers must consider a range of performance metrics, including scalability, ingestion efficiency, and retrieval accuracy. In this rigorous test against FinanceBench, Ragie demonstrated its ability to handle large-scale, complex financial documents with exceptional speed and precision, outperforming the Shared Store accuracy benchmark by 42%.

If you’d like to see how Ragie can handle your own large-scale or multi-modal documents, you can try Ragie’s Free Developer Plan.

Feel free to reach out to us at support@ragie.ai if you're interested in running the FinanceBench test yourself.

How to Build Smarter AI Apps and Reduce Hallucinations with RAG

Ibrahim Salami — Thu, 10 Oct 2024 21:20:15 +0000

With the rise of AI-powered apps, developers are continuously looking for ways to enhance the accuracy and relevance of AI-generated content. One of the most effective methods for achieving this is through Retrieval-Augmented Generation (RAG), which combines the power of LLMs with real-time access to external data sources. RAG makes AI applications more reliable, intelligent, and context-aware. Additionally, RAG can mitigate hallucination, which is when AI models generate false or misleading information.

In this blog, we’ll explore how developers can use RAG to build smarter AI apps and reduce hallucinations.

What Is Retrieval-Augmented Generation (RAG)?

RAG is an advanced technique that enhances LLMs by allowing them to pull real-time, relevant information from external databases, knowledge bases, or other sources. Traditional LLMs rely solely on the data they were trained on, which can lead to inaccurate or outdated results, especially when faced with complex, domain-specific questions. RAG provides a retrieval mechanism that can tap into live data sources, enabling LLMs to generate more accurate and relevant responses.

Why Use RAG to Build Smarter AI Applications?

RAG has several key benefits that make it ideal for developers looking to build more intelligent AI applications:

Real-Time Data Access: Traditional LLMs are limited by their training data, which can become outdated. RAG addresses this issue by retrieving real-time data from external sources, ensuring responses are up-to-date and accurate.
Improved Accuracy and Reliability: While LLMs are proficient at generating text, they can sometimes fabricate information when solid factual information isn’t present in their training data. RAG ensures responses are grounded in real, curated data, making it ideal for tasks where correctness is critical, such as research, journalism, or technical documentation.
Context-Aware Responses: RAG’s retrieval mechanism selects information relevant to the input query, ensuring that the responses are accurate and contextually aligned with the specific question or task at hand.
Fast and Efficient Retrieval: RAG utilizes vector and other specialized databases to quickly retrieve information based on semantic similarity, ensuring that the right information is available to the LLM in fractions of a second.
Increased Response Precision: The combination of RAG with LLMs results in answers that are not only more coherent but also more precise and informative, allowing AI to generate more comprehensive responses across text and multi-modal formats.

Avoiding AI Hallucinations with RAG

One of the biggest challenges developers face when using LLMs is dealing with hallucinations. Hallucinations occur when AI systems generate content that is factually incorrect, irrelevant, or misleading, often because the model attempts to fill gaps in its knowledge. Hallucination is a common problem with LLMs, RAG significantly reduces its occurrence by ensuring that responses are anchored in real, external data sources.

How RAG Reduces Hallucinations:

Real-time Data Retrieval: By accessing a continuously updated knowledge base, RAG allows models to generate responses based on current, factual data rather than outdated or incomplete training sets. This real-time retrieval ensures that AI-generated answers remain relevant and accurate.
Factual Consistency: RAG encourages models to produce responses that are aligned with the factual data retrieved. Instead of relying on the model’s built-in knowledge, which might contain inaccuracies or contradictions, it conditions the generation process on accurate and structured information from external sources.
Improved Contextual Understanding: One of the key benefits of RAG is its ability to retrieve contextually relevant information to the input query. It provides the AI with access to relevant and targeted data, enabling the model to generate more coherent and contextually appropriate responses. This helps avoid hallucinations where the AI might otherwise improvise.

How Ragie Helps Developers Build Smarter Generative AI Apps

Ragie is a fully managed RAG-as-a-service platform that simplifies the process of building smarter, RAG-powered AI applications. Developers can easily use Ragie APIs to index and retrieve multi-modal data (text, images, PDFs, etc.) to ensure factual accuracy and minimize hallucinations.

Key Features of Ragie that Help Reduce Hallucinations

Easily Sync Data: Ragie allows developers to connect their AI systems to external data sources like Google Drive, Notion, and Confluence. This ensures that the AI system always has real-time access to up-to-date and relevant information, reducing the chances of generating outdated or inaccurate responses.
Summary Index: Ragie’s advanced “Summary Index” feature helps prevent document affinity problems, where the AI might disproportionately rely on a small subset of documents that have high semantic similarity when key facts may be distributed across many documents. It helps the AI retrieve the most relevant sections from multiple diverse documents.
Entity Extraction for Structured Data: Ragie offers entity extraction capabilities, allowing developers to retrieve structured data from unstructured sources like PDFs or scanned documents. This feature helps AI systems understand and contextualize the information better, reducing the chances of hallucinating incorrect information.
Advanced Chunking and Retrieval: Ragie uses advanced chunking methods to break down large documents into manageable parts. This ensures that the AI retrieves only the most relevant chunks of information, providing a more focused and accurate response.
Scalable and Fast Pipelines: With Ragie, developers don’t need to worry about building and maintaining a complex data ingest and retrieval pipelines. Ragie’s fully managed service is scalable, reliable, and highly performant, allowing developers to focus on delivering their AI products without any compromises.

Conclusion

It is critical to ensure that AI-generated content is accurate. RAG helps developers build smarter and more context-aware AI applications, significantly reducing the risk of hallucinations.

Whether you’re building a chatbot, a knowledge-base, an agent, or an enterprise-grade AI solution, Ragie’s fully managed RAG-as-a-Service platform provides the tools and infrastructure necessary to ensure your AI applications are smarter, faster, and, most importantly, accurate. Ragie SDKs are open-source, please star us on GitHub.

Try Ragie for free —->

‍

How not to do code reviews

Ibrahim Salami — Mon, 07 Oct 2024 18:55:56 +0000

Traditionally, code reviews involved engineers scrutinizing a colleague’s code for errors and ensuring its readability, efficiency, and maintainability.

This approach results in bottlenecks, especially in large teams, because the right reviewers don’t always have the capacity to review changes when necessary. While solutions such as CODEOWNERS files try to fix these issues, they can make matters worse by creating knowledge silos and overloading domain experts.

All this leads to frustration and hinders progress—which, in turn, impacts release timelines and team morale.

The good news is that teams don’t have to rely on code reviews to identify bugs like in the 1970s. These days, you’re much better off relying on automated testing and static code analyzers to identify bugs. Modern code reviews can move beyond error finding and instead focus on growing a team that can maintain a healthy codebase in the long term.

The Traditional Approach: Error-Driven Code Reviews

Michael Fagan first described code reviews in his 1976 paper titled “Design and Code Inspections to Reduce Errors in Program Development.”

He focused on a formal inspection process that emphasized error-driven inspections of functional issues—issues that impact software’s runtime behavior, causing it to break or produce incorrect results. His proposed process focused exclusively on finding potential errors. It involved a group of developers manually running predefined inputs through the code and verifying that they produced the correct predefined outputs. If the code produced the incorrect output, it needed to be reworked and tested again using the same strategy.

This process was time-consuming, but such a manual, error-driven approach made sense back then. Automated testing tools were limited, so you needed humans to debug your code.

However, in practice, an error-driven approach tends to neglect evolvability issues—those that affect the code’s maintainability, readability, and future modifications. These defects often lead to technical debt, which hinders agility and increases development costs in the long run. However, error-driven inspections leave little room for discussions around alternative implementations and code structure considerations.

An error-driven approach also comes with other challenges. Deciding who should review a code change is a common challenge regardless of how you do code reviews. An error-driven approach exacerbates this issue because it relies so heavily on experienced engineers who are familiar with the codebase and technologies. Without them, obscure errors can make it to production.

Even if you use a CODEOWNERS file to designate individuals or teams as code owners of portions of your codebase—an approach that has many benefits—an error-driven approach still means that code owners become bottlenecks. You only have so many engineers with enough experience to detect issues, which delays code reviews and slows down the development process.

Most importantly, this approach keeps the most experienced members of a team trapped in a never-ending cycle of reviews and debugging. It creates knowledge silos among code owners and hinders knowledge sharing, which prevents the rest of the team from expanding their knowledge of the codebase.

The Modern Approach: Code Reviews as Knowledge Sharing

Over the past twenty years, modern tooling, such as static analyzers and development practices, like automated tests, have become exponentially better at finding errors, especially functional ones. These tools allow developers to focus on higher-level concerns, such as knowledge transfer, code architecture, and long-term code maintainability.

Instead of using code reviews for error finding and code fixes, they can now focus on the strategic aspects of code changes and knowledge sharing.

The Role of Automated Tests

Automated unit and integration tests are far better at finding logical bugs in code than human reviewers.

When reviewers look for these logic issues, they often run through the code line-by-line using different inputs and see if any lines cause the code to produce the wrong output. This takes significantly longer than an automated test, which can execute the code instantly and verify different inputs produce the correct outputs.

Automated tests can also consistently identify issues, whereas reviewers might miss them due to bias or human error.

Effective automated testing requires discipline to write proper tests, though. You need to take the time to identify different inputs and determine the correct output for each to develop comprehensive test cases. This includes identifying erroneous inputs and figuring out how the code should respond to them. Once you’ve identified different test cases, you need to write automated tests to check each case. Reviewers should also analyze automated tests and code changes to find any edge cases that might not be covered by existing tests.

This means effective automated testing does add development time—engineers need to write automated tests for every line of new code.

However, this time is made up in the review process since reviewers can then rely on automated tests to find logical bugs rather than manually testing the code with different inputs.

The Role of Static Analyzers

While automated tests can pick up logical issues in code, they don’t identify code vulnerabilities. Automated tests focus more on how software runs rather than what it uses to run. However, static code analyzers solve this problem.

Static code analyzers analyze code and its dependencies for potential security flaws. If it finds vulnerabilities, it alerts the code author to fix them by changing the affected lines of code or updating the dependencies. Without a static code analyzer, you’d need an experienced engineer to catch many of these vulnerabilities and keep them from making it into production.

For example, the JavaScript code sample below demonstrates an issue a static analyzer would detect that a developer might miss:

const n = NaN;

if (a === NaN) {
    console.log("Number is invalid");
}

JavaScript uses NaN, which stands for Not-A-Number, when you try to convert a non-numerical value to a number. To check if a variable is NaN, you should always use Number.isNaN(n) instead of n === NaN. It’s likely that a developer would miss this small detail, but an analyzer would pick it up immediately.

Static code analyzers can also enforce style guides. A static analyzer that’s been configured to use your style guide can run through changed code and identify any lines that violate those guidelines, such as incorrect naming or spacing issues. Static analyzers also often come preconfigured with coding best practices that allows them to find performance optimizations and maintainability issues such as missing documentation and overly complicated code.

Modern AI-powered static code analyzers can identify even more issues. Code analyzers that don’t use AI parse code and look for patterns that might cause bugs or create security vulnerabilities. While these patterns can identify some evolvability issues, such as code structure and style, they’re still limited.

However, AI analyzers can be trained on a codebase to understand the code architecture. When new changes are proposed, they can check if the changes align with the code’s architecture. They can also make sure the code fully meets the requirements and notify the author if any are not met. Because they’re better at understanding the bigger picture, AI-powered code analyzers can detect maintainability and code architecture issues that only human reviewers were able to catch before.

Like with automated tests, using static analyzers doesn’t mean human reviewers are not involved anymore. Reviewers must still examine their results for any false positives, but it takes a fraction of the time it did before these tools were available.

Code Review for Knowledge Sharing

With so much manual labor out of the way, code reviews now have a new purpose: building a healthier codebase in the long term and a more resilient, adaptable team.

Code reviews should now be done in a way to foster collaboration and continuous learning. This feedback and discussion are no longer mainly meant to improve the current code—it’s to grow a team that can build a healthy codebase in the long term.

Code reviews are also almost exclusively focused on evolvability defects, such as checking for missing documentation, improving algorithmic efficiency, and reducing cyclomatic complexity that might cause bottlenecks or maintainability issues. The aim isn’t only to fix a given issue, though, but to help the team learn from it. Code reviews might involve discussing these issues to expose less experienced engineers to higher-level programming concepts and help them see the big picture.

The aim is for all engineers, regardless of experience level, to contribute to and benefit from the review process.

One example of this modern approach to code reviews is peer programming, where multiple engineers work together on a single functionality piece. The engineer writing the code assumes the driver’s role while others review the code as it’s written and offer suggestions or point out potential errors.

You can strategically pair more experienced domain experts with less experienced reviewers to accelerate learning and reduce knowledge silos. Less experienced engineers gain exposure to expert feedback and best practices, while seniors benefit from fresh perspectives and must clearly articulate their reasoning.

Peer programming isn’t always possible—a team might be too small or spread across different time zones. In these cases, you can use pull requests or even email threads and mailing lists to achieve the same aims.

The emphasis should still be on thoroughly discussing and explaining issues and concepts rather than just pointing out issues that need fixing. Other engineers not involved in the review process can also read through these discussions at a later stage to understand why certain changes were made and how they fit into the big picture.

Conclusion

Traditional code review processes focused solely on error hunting, which creates bottlenecks and hinders team growth. While some tools, such as CODEOWNERS files, try to improve the process, an error-driven approach still doesn’t accommodate knowledge sharing, and you don’t get the full benefit of code reviews.

But these days, automated tests and static analyzers can pick up defects faster and more accurately than human reviewers. This means code reviews should focus less on error finding and instead prioritize knowledge sharing to avoid knowledge silos and encourage team growth.

The Aviator FlexReview is built to encourage knowledge sharing during code reviews. Like CODEOWNERS files, FlexReview reduces the effort of assigning reviewers to code changes—but with more flexibility. It considers reviewers’ workloads and availability, and you can configure it to assign less experienced reviewers with domain experts to facilitate knowledge sharing as part of the review process. You can register for a free account to try it out.

How to improve DORA metrics as a release engineer

Ibrahim Salami — Tue, 01 Oct 2024 11:03:23 +0000

Ensuring efficient, reliable, high-quality software releases is crucial in software development. This is where release engineering comes into play. This blog will explore release engineering, its importance, and how release engineers can significantly influence key DevOps Research and Assessment (DORA) metrics.

What is Release Engineering?

Release engineering is a specialized discipline within software development focused on the processes and practices that ensure software is built, packaged, and delivered efficiently and reliably. It involves coordinating various aspects of software creation, from source code management to deployment.

A release engineer ensures that software releases are smooth and efficient, maintaining high standards of quality and reliability. They manage the build and deployment pipelines, automate repetitive tasks, and work closely with development, operations, and QA teams.

Key Components of Release Engineering

Version Control: Manage code changes using systems like Git and implement branching strategies.

Build Automation: Utilizing tools like Maven, Gradle, or Make to automate the build process alongside CI tools like Jenkins or GitHub Actions.

Artifact Management: Storing build artifacts in repositories such as JFrog Artifactory, Nexus, or AWS S3.

Testing: Implementing automated testing strategies, including unit, integration, and end-to-end tests.

Deployment Automation: Using CD tools like Spinnaker or ArgoCD to automate deployments, managed with IaC tools like Terraform or Ansible.

Configuration Management: Handling environment-specific configurations with tools like HashiCorp Consul or AWS Parameter Store.

Monitoring and Logging: Employing tools like Prometheus, Grafana, or the ELK Stack to monitor performance and centralized logging.

Importance of Release Engineering

Release engineering is crucial for:

Ensuring efficient and reliable software releases – Streamlined processes reduce downtime and ensure consistent releases.
Reducing human error through automation – Automation minimizes the risk of errors, ensuring more predictable outcomes.
Enhancing collaboration – Bridging gaps between development, operations, and QA teams improves overall workflow.
Quick Rollback and Recovery Mechanisms-Effective release engineering ensures that issues can be swiftly addressed and systems restored.

DORA (DevOps Research and Assessment) metrics are essential performance indicators used to check the effectiveness of software delivery and operational practices. They provide insights into the performance and health of DevOps processes.

Importance of DORA Metrics

DORA metrics are essential because they help organizations understand their software delivery performance, identify areas for improvement, and drive continuous improvement. They offer a data-driven approach to enhancing efficiency and reliability.

Key DORA Metrics

Deployment Frequency: Deployment frequency measures how often new code is deployed to production. Higher frequency indicates a more agile and responsive development process.
Lead Time for Changes: Lead time for changes measures the duration from when a code change is committed until it is deployed to production. Shorter lead times indicate a more efficient development pipeline.
Change Failure Rate: The duration from a code commit to its successful deployment in production. Change failure rate indicates the percentage of deployments that lead to a failure in the production environment. Lower rates indicate more reliable releases.
Mean Time to Recovery (MTTR): MTTR calculates the duration required to restore service following a failure. A lower MTTR signifies a more resilient and responsive system.

Real-world Implementation

We will use a Powershell script to calculate the four critical metrics from Azure DevOps pipelines. The computed result will be stored in a Log Analytics Workspace. We will use Grafana as the data visualization tool to plot the Dashboard.

Below is the sample dashboard we can see after adding Azure data sources in Grafana. Snippets from the PowerShell scripts used to compute each metric are also below.

The complete code can be found at:

https://github.com/rajputrishabh/DORA-Metrics

Calculating Mean Time to Recovery

To calculate MTTR, sum up the time taken to recover from all incidents over time and divide by the number of incidents.

MTTR = Total downtime / Number of incidents

#calculate MTTR per day
  if($maintainencetime -eq 0){
    $maintainencetime=1
  }
  if($failureCount -gt 0 -and $noofdays -gt 0){
    $MeanTimetoRestore=($maintainencetime/$failureCount)
  }
  $dailyDeployment=1
  $hourlyrestoration=(1/24)
  $weeklyDeployment=(1/7)
 
  #calculate Maturity
  $rating=""

  if($MeanTimeToRestore -eq 0){
  $rating=" NA"
  }
  elseif($MeanTimeToRestore -lt $hourlyrestoration){
    $rating="Elite"
  }
  elseif($MeanTimeToRestore -lt $dailyDeployment){
    $rating="High"
  }
  elseif($MeanTimeToRestore -lt $weeklyDeployment){
    $rating ="Medium"
  }
  elseif($MeanTimeToRestore -ge $weeklyDeployment){
  $rating="Low"
  } 
  if($failureCount -gt 0 -and $noofdays -gt 0){
    Write-Output "Mean Time to Restore of $($pipelinename) for $($stgname) for release id $($relid)
 over last $($noofdays) days, is $($displaymetric) $($displayunit), with DORA rating of '$rating'"
  }
  else{
    Write-Output "Mean Time to Restore of $($pipelinename) for $($stgname) for release id $($relid) 
 over last $($noofdays) days ,is $($displaymetric) $($displayunit), with DORA rating of '$rating'"
}

Calculating Deployment Frequency

Count the number of deployments to production over a specific period to calculate deployment frequency.

Deployment Frequency = Number of deployments / Time period

#calculate DF per day
  $deploymentsperday=0
  if($releasetotal -gt 0 -and $noofdays -gt 0){
  $deploymentsperday=$timedifference/$releasetotal
  }

  $dailyDeployment=1
  $weeklyDeployment=(1/7)
  $monthlyDeployment=(1/30)
  $everysixmonthDeployment=(1/(6*30))
  $yearlyDeployment=(1/365)
 
  #calculate Maturity
  $rating=""
  if($deploymentsperday -eq 0){
    $rating=" NA"
  }
  elseif($deploymentsperday -lt $dailyDeployment){
    $rating="Elite"
  }
  elseif($deploymentsperday -ge $dailyDeployment -and  $deploymentsperday -gt 
 $weeklyDeployment){
    $rating="High"
  }
  elseif($deploymentsperday -ge $weeklyDeployment -and $deploymentsperday -gt 
 $monthlyDeployment){
      $rating ="Medium"
  }
  elseif($deploymentsperday -ge $monthlyDeployment -and  $deploymentsperday -ge 
 $everysixmonthDeployment){
      $rating="Low"
  }
  if($releasetotal -gt 0 -and $noofdays -gt 0){
    Write-Output "Deployment frequency of $($pipelinename) for $($stgname)  for release id $($relid) 
 over last $($noofdays)  days, is $($displaymetric) $($displayunit), with DORA rating of  '$rating'"
  }
  else{
    Write-Output "Deployment frequency of $($pipelinename)  for $($stgname)  for release id $($relid)
 over last $($noofdays)  days, is $($displaymetric) $($displayunit), with DORA rating of '$rating'"
  }

Calculating Change Failure Rate

To calculate the change failure rate, divide the number of failed deployments by the total number of deployments and multiply by 100 to get a percentage.

Change Failure Rate (%) = (Failed deployments / Total deployments) * 100

The PowerShell script to calculate CFR is in the repository linked above.

Calculating Lead Times for Changes

To calculate the lead time for changes, measure the time from code commit to deployment for each change and calculate the average.

Lead Time for Changes = Sum of (Deployment time – Commit time) / Number of changes

The PowerShell script to calculate LTC can be found in the repository linked above.

How Release Engineers Can Influence DORA Metrics

Release engineers play a pivotal role in shaping and improving key DORA metrics, which are crucial for assessing the efficiency and reliability of software delivery. Below, we delve into practical strategies with real-world examples from companies like Etsy, Google, Netflix, and Amazon to illustrate how release engineers can positively impact Deployment Frequency, Change Failure Rate, Lead Time for Changes, and Mean Time to Recovery.

Improving Deployment Frequency – Etsy

Example: Implementing CI/CD Pipelines at Etsy

Strategy: To enhance deployment frequency, Etsy adopted continuous integration and continuous deployment (CI/CD) practices and several tools, such as Try and Deployinator.

Implementation

Automation: They automated their build, test, and deployment processes using Jenkins and custom scripts, enabling multiple daily deployments.

Feature Toggles: Introduced feature toggles to safely deploy incomplete features without affecting end users.

Outcome

Etsy achieved the capability to deploy code changes to production around 50 times a day, significantly increasing their deployment frequency.

Reducing Change Failure Rate – Google

Example: Comprehensive Testing at Google

Strategy: Google emphasizes comprehensive automated testing to reduce the change failure rate.

Implementation

Testing: Google integrated unit tests, integration tests, and end-to-end tests into its CI pipeline. It uses tools like GoogleTest and Selenium for various levels of testing.

Code Reviews: Established a rigorous code review process where peers review each change before it is merged, ensuring high code quality.

Outcome

By catching issues early in the development process, Google reduced the number of failed deployments, lowering their change failure rate.

https://testing.googleblog.com/

Shortening Lead Time for Changes – Netflix

Example: Streamlined Build Process at Netflix

Strategy: Netflix optimized its build and deployment processes to shorten the lead time for changes.

Implementation

Optimized Pipelines: Netflix used Spinnaker, an open-source multi-cloud continuous delivery platform, to streamline their deployment pipelines.

Microservices Architecture: Adopted a microservices architecture, which allowed smaller, more manageable changes to be deployed independently.

Outcome

Netflix reduced its lead time for changes from days to minutes, allowing for rapid iteration and deployment.

Reducing Mean Time to Recovery (MTTR) – Amazon

Example: Robust Monitoring and Quick Rollback at Amazon

Strategy: Amazon focuses on robust monitoring and quick rollback mechanisms to minimize MTTR.

Implementation

Monitoring: Extensive monitoring was implemented using AWS CloudWatch, enabling proactive detection of issues.

Rollback Mechanisms: Developed automated rollback procedures using AWS Lambda functions and CloudFormation scripts to revert to a previous stable state quickly.

Outcome

Amazon reduced their MTTR significantly, ensuring quick recovery from incidents and maintaining high service availability.

https://aws.amazon.com/documentation/

Deployment Frequency and Lead Time for Changes evaluate the speed of delivery, whereas Change Failure Rate and Time to Restore Service evaluate stability. By tracking and continuously improving these metrics, teams can achieve significantly better business results. Based on these metrics, DORA categorizes teams into Elite, High, Medium, and Low performers, finding that Elite teams are twice as likely to achieve or surpass their organizational performance goals.

Pitfalls of DORA Metrics for Release Engineers

While DORA metrics provide valuable insights into software delivery performance and operational practices, they come with challenges and potential pitfalls. Understanding these can help release engineers avoid common mistakes and make more informed decisions.

Overemphasis on Metrics Over Quality

Pitfall: Focusing solely on improving DORA metrics can lead to overlooking the overall quality of the software. Teams might rush changes to increase deployment frequency or reduce lead time, compromising the product’s robustness and security. This is a classic case of “Goodhart’s Law, which states that when a measure becomes a target, it ceases to be a good measure”.

Solution: Balance the focus on metrics with a commitment to maintaining high-quality standards. Implement thorough testing and code review processes to ensure quality is not sacrificed for speed.

Misinterpreting Metrics

Pitfall: DORA metrics can be misinterpreted without context. For example, a high deployment frequency might seem optimistic but could indicate frequent hotfixes for recurring issues, highlighting underlying problems rather than improvements.

Solution: Analyze metrics within the context of overall performance and other relevant data. Use complementary metrics and qualitative insights to view the team’s effectiveness comprehensively.

Neglecting Team Morale

Pitfall: Intense focus on improving DORA metrics can result in burnout and decreased morale among team members. Pushing for more frequent deployments or faster lead times without considering workload can negatively impact the team’s well-being.

Solution: Foster a healthy work environment by setting realistic goals and ensuring adequate support and resources for the team. Encourage open communication about workloads and stress levels.

Lack of Actionable Insights

Pitfall: Collecting and reporting DORA metrics without deriving actionable insights can lead to data without purpose. Teams might track metrics but fail to implement changes based on the findings.

Solution: Review and analyze DORA metrics regularly to identify trends and areas for improvement. Using the insights obtained from the metrics, develop and execute action plans.

Insufficient Tooling and Automation

Pitfall: Inadequate tooling and automation can hinder efforts to improve DORA metrics. Manual processes and outdated tools can slow down deployments and increase lead times.

Solution: Invest in modern CI/CD tools, automated testing frameworks, and infrastructure as code solutions. Continuously evaluate and update the toolchain to ensure it supports efficient workflows.

Conclusion

Release engineering is a cornerstone of modern software development, ensuring that software is released efficiently, reliably, and with high quality. Release engineers can significantly enhance their software delivery performance by understanding and effectively utilizing DORA metrics. However, it’s essential to be mindful of the potential pitfalls and to balance metric improvement with maintaining overall quality and team morale. Best practices and utilizing appropriate tools can help release engineers drive meaningful improvements and achieve better outcomes.

To effectively influence these metrics, release engineers should focus on:

Automation: Automate build, test, and deployment processes using robust CI/CD pipelines to increase deployment frequency and reduce lead times.
Comprehensive Testing: Implement comprehensive automated testing to catch issues early and lower the change failure rate.
Efficient Rollback Mechanisms: Establish quick rollback strategies and robust monitoring to minimize MTTR.
Continuous Improvement: Regularly review and iterate on processes based on DORA metrics to foster continuous improvement and ensure high-quality software delivery.

Frequently Asked Questions

Q1: What are DORA metrics?

DORA (DevOps Research and Assessment) metrics are essential performance indicators for evaluating the effectiveness of software delivery and operational practices. The four main DORA metrics are Deployment Frequency (DF), Lead Time for Change, Change Failure Rate, and Mean Time to Recovery (MTTR).

Q2: Why are DORA metrics important?

DORA metrics provide valuable insights into the performance and health of software delivery processes. They help identify bottlenecks, measure improvements, and drive continuous improvement in DevOps practices, leading to more efficient and reliable software delivery.

Q3: How often should I review and analyze DORA metrics?

Regularly review DORA metrics, ideally on a weekly or bi-weekly basis, to continuously monitor performance and identify areas for improvement. Use these reviews to inform decisions and drive ongoing enhancements in the software delivery process.

Q4: What tools can help improve DORA metrics?

CI/CD Tools: Jenkins, GitHub Actions, GitLab CI, CircleCI
Build Automation Tools: Maven, Gradle, Make, Ant
Artifact Management: JFrog Artifactory, Nexus, AWS S3
Configuration Management: HashiCorp Consul, Spring Cloud Config, AWS Parameter Store
Monitoring and Logging: Prometheus, Grafana, New Relic, ELK Stack

Q5: How can I measure the current state of my DORA metrics?

Deployment Frequency: Measure the number of deployments within a defined timeframe.
Lead Time for Changes: Measure the time from code commit to production deployment.
Change Failure Rate: Divide the number of failed deployments by the total deployments.
Mean Time to Recovery: Track and average the time from incident detection to resolution.

Q6: What is release engineering?

Release engineering is a discipline within software development focused on the processes and practices for building, packaging, and delivering software efficiently and reliably. It involves coordinating various aspects of software creation, from source code management to deployment.

Overcoming Web Scraping challenges with Firecrawl, an open-source AI tool

Ibrahim Salami — Fri, 27 Sep 2024 09:07:19 +0000

Web scraping is an art, and Firecrawl is your paintbrush. It can be difficult because we’re constantly faced with blockers like JavaScript-heavy content, CAPTCHAs, and strict rate limits. Fortunately, Firecrawl is designed to address common web scraping problems. This guide will take you through Firecrawl’s capabilities, showing you how to scrape, crawl, and extract data like a pro.

Getting Started with Firecrawl

Let’s begin with a quick setup. To scrape a single page and extract clean markdown data with Firecrawl handling all the complexities in the background; use the /scrape endpoint.

Here’s a simple example using Python:

# pip install firecrawl-py
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY")
content = app.scrape_url("https://docs.firecrawl.dev")

print(content["data"]["markdown"])  # Outputs the scraped content in markdown format

But Firecrawl isn’t just about scraping plain web pages. Let’s dive into some advanced options that make Firecrawl truly shine.

Advanced Scraping Options

Scraping PDFs

By default, the /scrape endpoint can extract text content from PDFs. However, if you want to skip this, simply set pageOptions.parsePDF to false.

Page Options: Fine-Tuning Your Scrape
Firecrawl gives you control over what and how you scrape. Here’s a breakdown of the key pageOptions parameters:

onlyMainContent: Scrape the main content of a page and ignore headers, footers, and sidebars.
includeHtml: Useful for when you need the HTML version of the content, enable this to add an html key in the response.
includeRawHtml: For those who want raw HTML, use this option to add rawHtml key to the response.
screenshot: This option captures a screenshot of the top of the page.
waitFor: Sometimes pages take time to load. Use this to specify a wait time in milliseconds before scraping.

Example: Combining Page Options

Here’s how you might combine these options in a single request:

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization : Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "pageOptions": {
        "onlyMainContent": true,
        "includeHtml": true,
        "includeRawHtml": true,
        "screenshot": true,
        "waitFor": 5000
      }
    }'

In this code, Firecrawl will return only the main content, including both raw and processed HTML, capture a screenshot, and wait 5 seconds for the page to fully load.

Extractor Options: Getting Structured Data

Beyond scraping, Firecrawl helps you extract structured data from any content using the extractorOptions parameter.

mode: Choose between llm-extraction (from cleaned data) and llm-extraction-from-raw-html (directly from raw HTML).
extractionPrompt: Describe what information you want to extract.
extractionSchema: Define the structure of the extracted data.

Example: Extracting Data with a Schema

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev/",
      "extractorOptions": {
        "mode": "llm-extraction",
        "extractionPrompt": "Extract the company mission, SSO support, open-source status, and YC status.",
        "extractionSchema": {
          "type": "object",
          "properties": {
            "company_mission": { "type": "string" },
            "supports_sso": { "type": "boolean" },
            "is_open_source": { "type": "boolean" },
            "is_in_yc": { "type": "boolean" }
          },
          "required": ["company_mission", "supports_sso", "is_open_source", "is_in_yc"]
        }
      }
    }'

This request will not only scrape the content but also extract specific pieces of information according to your defined schema. For example, this setup extracts structured information like company mission, SSO support, open-source status, and YC affiliation directly from the content.

Crawling Multiple Pages

Sometimes one page isn’t enough. That’s where the /crawl endpoint comes in; it allows you to scrape an entire site. You can specify a base URL, and Firecrawl will handle the rest, capturing all accessible subpages.

Example: Customizing Your Crawl

This setup shows you how to customize your crawl specific options:

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "crawlerOptions": {
        "includes": ["/blog/*", "/products/*"],
        "excludes": ["/admin/*", "/login/*"],
        "returnOnlyUrls": false,
        "maxDepth": 2,
        "mode": "fast",
        "limit": 1000
      }
    }'

In this configuration, Firecrawl will:

Crawl pages matching the /blog/* and /products/* subpaths.
Skip pages matching /admin/* and /login/*.
Crawl up to two levels deep and up to 1000 pages in total.
Use the fast crawling mode for quicker results.

Combining Page and Crawler Options

For more control, combine pageOptions with crawlerOptions in a single request:

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "pageOptions": {
        "onlyMainContent": true,
        "includeHtml": true,
        "includeRawHtml": true,
        "screenshot": true,
        "waitFor": 5000
      },
      "crawlerOptions": {
        "includes": ["/blog/*", "/products/*"],
        "maxDepth": 2,
        "mode": "fast"
      }
    }'

With this setup, Firecrawl will deliver precisely the data you need, exactly how you need it.

You can get started with free $500 Firecrawl Credits (no credit card required) or you can self-host the open-source version.

Rethinking code reviews with stacked PRs

Ibrahim Salami — Mon, 16 Sep 2024 18:03:57 +0000

The peer code review process is an essential part of software development. It helps maintain software quality and promotes adherence to standards, project requirements, style guides, and facilitates learning and knowledge transfer.

Code review effectiveness

While the effectiveness is high for reviewing sufficiently small code changes, it drops exponentially with the increase in the size of the change. To sustain the necessary level of mental focus to be effective, large code reviews are exhausting. Usually, the longer the review duration gets, the less effective the overall review becomes:

So why can’t we just restrict the size of the pull requests (PRs)? While many changes can start small, suddenly a small two-line change can grow into a 500-line refactor including multiple back-and-forth conversations with reviewers. Some engineering teams also maintain long-running feature branches as they continue working, making it hard to review.

So, how do we strike the right balance? Simple. Use stacked PRs.

What are stacked PRs?

Stacked pull requests make smaller, iterative changes and are stacked on top of each other instead of bundling large monolith changes in a single pull request. Each PR in the stack focuses on one logical change only, making the review process more manageable and less time-consuming.

We also wrote a post last year explaining how this help represents code changes as a narrative instead of breaking things down by files or features.

Why stacked PRs?

Other than building a culture of more effective code reviews, there are a few other benefits of stacked PRs:

Early code review feedback

Imagine that you are implementing a large feature. Instead of creating the entire feature and then requesting a code review, consider carving out the initial framework and promptly putting it up for feedback. This could potentially save you countless hours by getting early feedback on your design.

Faster CI feedback cycle

Stacked PRs support the shift-left practice because changes are continuously integrated and tested, which allows for early detection and rectification of issues. The changes are merged in bits and pieces catching any issues early vs merging one giant change hoping it does not bring down prod!

Knowledge sharing

Code reviews are also wonderful for posterity. Your code changes are narrating your thought process behind implementing a feature, therefore, the breakdown of changes creates more effective knowledge transfer. It’s easier for team members to understand the changes, which promotes better knowledge sharing for the future.

Staying unblocked

Waiting on getting code reviewed and approved can be a frustrating process. With stacked PRs, the developers can work on multiple parts of a feature without waiting for reviewers to approve previous PRs

What’s the catch?

So, why don’t more developers use stacked PRs for code reviews?

Although this stacked PR workflow addresses both the desired practices of keeping code reviews manageable and developers productive, unfortunately, it is not supported very well natively by either git or GitHub. As a result, several tools have been developed across the open-source community to enable engineers to incorporate this stacking technique into the existing git and GitHub platforms. But stacking the PRs is only part of the story.

Updating

As we get code review feedback and we make changes to part of the stack, we have to now rebase and resolve conflicts at all subsequent branches.

Let’s take an example. Imagine that you are working on a change that requires making a schema change, a backend change, and a frontend change. With that, you can now send a simple schema change for review first, and while that’s being reviewed you can start working on the backend and frontend. Using stacked PRs, all these 3 changes can be reviewed by 3 different reviews.

In this case, you may have a stack that looks like this where demo/schema, demo/backend and demo/frontend represents the 3 branches stacked on top of each other.

So far this makes sense, but what if you got some code review comments on the schema change that requires creating a new commit? Suddenly your commit history looks like this:

Now you have to manually rebase all subsequent branches and resolve conflicts at every stage. Imagine if you have 10 stacked branches where you may have to resolve the conflicts 10 times.

Merging

But that’s not all, merging a PR in the stack can be a real nightmare. You have 3 options squash, merge and rebase to merge a PR. Let’s try to understand what goes behind the scenes in each one.

In the case of a squash commit, Git takes changes from all the existing commits of the PR and rewrites them into a single commit. In this case, no history is maintained on where those changes came from
A merge commit is a special type of Git commit that is represented by a combination of two or more commits. So, it works very similar to a squash commit but it also captures information about its parents. In a typical scenario, a merge commit has two parents: the last commit on the base branch (where the PR is merged) and the top commit on the feature branch that was merged. Although this approach gives more context to the commit history, it inadvertently creates non-linear git-history that can be undesirable.
Finally, in case of a rebase and merge, Git will rewrite the commits onto the base branch. So similar to squash commit option, it will lose any history associated with the original commits.

Typically if you are using the merge commit strategy while stacking PRs, your life will be a bit simpler, but most teams discourage using that strategy to keep the git-history clean. That means you are likely using either a squash or a rebase merge. And that creates a merge conflict for all subsequent unmerged stacked branches.

In the example above, let’s say we squash merge the first branch demo/schema into mainline. It will create a new commit D1 that contains changes of A1 and A2. Since Git does not know where D1 came from, and demo/backend is still based on A2, trying to rebase demo/backend on top of the mainline will create merge conflicts.

Likewise, rebasing demo/frontend after rebasing demo/backend will also cause the same issues. So if you had ten stacked branches and you squash merged one of them, you would have to resolve these conflicts nine times.

We are still just scratching the surface, there are many other use cases such as reordering commits, splitting, folding, and renaming branches, that can create huge overhead to manage when dealing with stacked PRs.

That’s why we built stacked PRs management as part of Aviator.

Why Aviator CLI is different

Think of Aviator as an augmentation layer that sits on top of your existing tooling. Aviator connects with GitHub, Slack, Chrome, and Git CLI to provide an enhanced developer experience.

Aviator CLI works seamlessly with everything else! The CLI isn’t just a layer on top of Git, but also understands the context of stacks across GitHub. Let’s consider an example.

Creating a stack

Creating a stack is fairly straightforward. Except in this case, we use av CLI to create the branches to ensure that the stack is tracked. For instance, to create your schema branch and corresponding PR, follow the steps below.

av stack branch demo/schema
# make schema changes
git commit -a -m "[demo] schema changes"
av pr create

Since Aviator is also connected to your GitHub, it makes it easy for you to visualize the stack.

Or if you want to visualize it from the terminal, you can still do that with the CLI commands:

Updating the stack

Using the stack now becomes a cakewalk. You can add new commits to any branch, and simply run av stack sync from anywhere in the stack to synchronize all branches. Aviator automatically rebases all the branches for you, and if there’s a real merge conflict, you just have to resolve it once.

Merging the stack

This is where Aviator tools easily stand out from any existing tooling. At Aviator, we have built one of the most advanced MergeQueue to manage auto-merging thousands of changes at scale. Aviator supports seamless integration with the CLI and stacked PRs. So to merge partial or full stack of PRs, you can assign them to Aviator MergeQueue using CLI av pr queue or by posting a comment in GitHub: /aviator stack merge.

Aviator automatically handles validating, updating, and auto-merging all queued stacks in order.

Now when the PRs are merged, you can this time run av stack sync --trunk to update all PRs and clean out all merged PRs.

Shift-Left is the future

Stacked PRs might initially seem like more work due to the need to break down changes into smaller parts. However, the increase in code review efficiency, faster feedback loops, and enhanced learning opportunities will surely outweigh this overhead. As we continue embracing the shift-left principles, stacked PRs will become increasingly useful.

The Aviator CLI provides a great way to manage stacked PRs with a lot less tedium. The CLI is open-source and completely free. We would love for you to try it out and share your feedback on our discussion board.

At Aviator, we are building developer productivity tools from first principles to empower developers to build faster and better.

How to configure IAM using Terraform

Ibrahim Salami — Fri, 06 Sep 2024 15:36:29 +0000

Organizations or individuals typically manage IAM using consoles and hesitate to use Infrastructure-as-code (IaC) as it is complex and sensitive to define IAM policies due to security risks. With frequent dynamic changes, you do not get immediate feedback. And more expertise is needed to configure and manage IAM rules with IaC. However, configuring IAM though IaC also have several benefits.

In this blog we’ll explore those benefits, discuss strategies for IAM management via Terraform, explain why implementing Zero Trust policies within IAM is crucial for security, and how to enforce IAM best practices, Policy-as-code, and IAM governance.

Why manage IAM through Infrastructure-as-code?

Automation and consistency – It offers automation, consistency, repeatability, and versioning to IAM policies and role management.
Audit trails – It allows you to maintain a comprehensive audit trail of changes to IAM configurations. This helps with compliance requirements and allows you to easily track who made changes when they were made, and why.
Least Privileges – Terraform’s expressive language allows for defining complex IAM policies with fine-grained control over permissions. Teams can more easily provision their own access in a controlled manner through pull requests, which then undergo a review process before being applied, fostering a self-service infrastructure model.

Zero Trust policies within IAM

Identity and Access Management (IAM) is a critical component of Zero-Trust security, assuming you do not trust anybody. Zero trust in IAM means:

Never Trust, Always Verify – There is no automatic trust. Always verify everyone who is trying to access the resource.
Least Privilege Access – Limits access to resources to the minimum necessary to perform a specific task and reduce the blast radius. This is to stop people from allowing unnecessary permissions to users/roles, which can cause breaking changes.
MFA – Multi-factor authentication enables the extra security layer on IAM users/roles.

Setting Up AWS IAM Policies with Terraform

Setting up AWS IAM policies with Terraform involves defining your IAM resources in Terraform configuration files, applying best practices for security and organization, and using Terraform’s capabilities to manage these resources as code. Below, we’ll outline a basic approach to setting up IAM policies in AWS using Terraform, including an example configuration.

In this blog post, we will cover the Terraform configs that are also compatible with OpenTofu.

Prerequisites

Before you start, ensure you have the following:

Terraform is installed on your machine.
An AWS account and AWS CLI configured with access credentials.
Basic knowledge of IAM concepts (e.g., policies, roles, users) and Terraform syntax.

Step 1: Initialize Terraform Project

Create a new directory for your Terraform project and initialize it with a main.tf file. Then, run terraform init in the project directory to prepare your directory for Terraform operations.

Step 2: Define the AWS Provider

In main.tf, start by defining the AWS provider. This specifies which version of the AWS provider to use and configures the region and other provider settings.

provider “aws” {
  region = “us-east-1”
  #AWS account access keys credentials
  access_key = “A***************”
  secret_key = “U******************”
}

Step 3: Define IAM Policy

Define an IAM policy using the aws_iam_policy resource. You need to provide a name and a policy document. The policy document can be defined inline using the <<EOF … EOF syntax, or it can be loaded from a file using the file() function.

Example of an inline policy definition:

# Create IAM policy to allow S3 read access
resource “aws_iam_policy” “s3_read_policy” {
  name        = “s3_read_policy”
  description = “Allows read access to files in the specified S3 bucket”
  policy      = <<EOF
{
  “Version”: “2012-10-17”,
  “Statement”: [
    {
      “Effect”: “Allow”,
      “Action”: [
        “s3:GetObject”,
        “s3:ListBucket”
      ],
      “Resource”: [
        “arn:aws:s3:::your-bucket-name/*”,
        “arn:aws:s3:::your-bucket-name”
      ]
    }
  ]
}
EOF
}

We can see the iam_policy creation by running the terraform plan command.

terraform plan command

output from terraform plan

Step 4: Create IAM user

Define an IAM user

# Create IAM user
resource “aws_iam_user” “iam_user” {
  name = “iam_user”
}

We can see the iam_user creation by running terraform plan command.

output for terraform plan

Step 5: Create an IAM Role and Attach the Policy

Define an IAM role and set assume role policy to allow the IAM user to assume the role

# Create an IAM role
resource “aws_iam_role” “iam_role” {
  name = “iam-role”
  assume_role_policy = jsonencode({
    Version = “2012-10-17”,
    Statement = [{
      Action = “sts:AssumeRole”,
      Principal = {
        AWS = “arn:aws:iam::AWS_ACCOUNT_ID:user/${aws_iam_user.iam_user.name}” # Replace [AWS_ACCOUNT_ID] with Account’s AWS account ID
      },
      Effect = “Allow”,
      Sid    = “ ”
    }]
  })
}

We can see the iam_role creation by running terraform plan command.

output for terraform plan

Step 6: Attach IAM policy to IAM role

# Attach IAM policy to IAM role
resource “aws_iam_policy_attachment” “s3_read_attach” {
  roles       = [aws_iam_role.iam_role.name]
  policy_arn = aws_iam_policy.s3_read_policy.arn
  name     = “Attaching s3 policy to iam role”
}

We can see the iam_role creation by running terraform plan command.

iam s3 policy – terraform plan output

Step 7: Apply Configuration

The Terraform configuration has been defined. Apply it using the command terraform apply in your project directory. This command will prompt you to review the proposed changes and confirm them. Upon confirmation, Terraform will create the resources in AWS according to your configuration.

After running the final terraform apply command, we can see the iam-role, iam_user, and s3ReadPolicy resources.

Utilizing Terraform’s templatefile function for dynamic policy generation

The templatefile() function in Terraform allows you to dynamically generate configuration files using templates. You can use this function to generate IAM policy documents dynamically, which can be helpful in cases where policies need to be customized based on dynamic inputs.

Here’s an example of how you can use the templatefile() function to dynamically generate an IAM policy document for S3 read access:

provider “aws” {
  region = “us-east-1”
  #AWS account access keys credentials
  access_key = “A***************”
  secret_key = “U******************”
}

# Define IAM policy template. This works prior to terraform 0.12
data “template_file” “s3_read_policy_template” {
  template = <<EOF
{
  “Version”: “2012-10-17”,
  “Statement”: [
    {
      “Effect”: “Allow”,
      “Action”: [
        “s3:GetObject”,
        “s3:ListBucket”
      ],
      “Resource”: [
        “${bucket_arn}/*”,
        “${bucket_arn}”
      ]
    }
  ]
}
EOF
  vars = {
    bucket_arn = “arn:aws:s3:::your-bucket-name”
  }
}

You can also store the above template snippet into a separate s3_read_policy.tmpl template file for Terraform above version 0.12 and reference it as shown below:

# Content of s3_read_policy.tmpl file
 {
  “Version”: “2012-10-17”,
  “Statement”: [
    {
      “Effect”: “Allow”,
      “Action”: [
        “s3:GetObject”,
        “s3:ListBucket”
      ],
      “Resource”: [
        “${bucket_arn}/*”,
        “${bucket_arn}”
      ]
    }
  ]
}

# Create IAM policy using the template
resource “aws_iam_policy” “s3_read_policy” {
  provider = aws.account_a
  name     = “s3_read_policy”
  policy   = templatefile( “${path.module}/template_file.tpl”,
{
  bucket_arn = “arn:aws:s3:::BUCKET_NAME” #provide the s3 bucket name
} )
}

We can see the s3_read_policy creation by running terraform plan command.

iam s3 read policy – terraform plan output

# Create IAM user
resource “aws_iam_user” “iam_user” {
  name     = “iam_user”
}

We can see the iam_user creation by running terraform plan command.

output for terraform plan

# Create an IAM role
resource “aws_iam_role” “reader_role” {
  name = “reader_role”
  assume_role_policy = jsonencode({
    Version   = “2012-10-17”,
    Statement = [{
      Action    = “sts:AssumeRole”,
      Principal = {
        AWS = “arn:aws:iam::AWS_ACCOUNT_ID:user/${aws_iam_user.iam_user.name}” # Replace [AWS_ACCOUNT_ID] with Account’s AWS account ID
      },
      Effect    = “Allow”,
      Sid       = “AssumeRole“
    }]
  })
}

We can see the iam_role creation by running terraform plan command.

output for terraform plan

# Attach IAM policy to IAM role
resource “aws_iam_policy_attachment” “s3_read_attach” {
  name       = “s3_read_attach”
  roles      = [aws_iam_role.reader_role.name]
  policy_arn = aws_iam_policy.s3_read_policy.arn
}

We can see the iam_policy_attachment creation by running terraform plan command.

iam s3 policy – terraform plan output

This configuration reads the content of the s3_read_policy.tmpl file using the file function and then using it to create the IAM policy. You can adjust the file path, policy name, and bucket ARN per your use case.

This approach allows you to generate IAM policies dynamically based on inputs or variables, providing flexibility in your Terraform configurations. Adjust the template and variables as needed for your specific use case.

After running the final terraform apply command, we can see the iam_user, reader-role, and attached s3_read_policy resources have been created in the AWS, as shown below.

Cross-Account Management

Cross-account access is particularly beneficial when organizations maintain multiple AWS accounts for different purposes, such as development, staging, and production.

Benefits of cross-account management

Without cross-account access, managing user access across these accounts can become complex and cumbersome.
It helps to enforce least privilege principles by allowing administrators to define granular access controls using IAM roles. This ensures users can only access the resources required for their roles or responsibilities.
Cross-account access facilitates centralized user management and auditing.
Administrators can create and manage users centrally in an AWS management account, reducing the administrative overhead of managing user identities across multiple accounts.
Auditing and tracking user access become more straightforward as all access requests and actions are logged centrally in the AWS management account.
Cross-account access is crucial for ensuring streamlined operations in large organizations where a security team manages multiple AWS accounts centrally.

Unlocking Cross-Account Access in AWS with Terraform

Let’s use Terraform to create an IAM user in AWS Account A and establish cross-account access with the “AssumeRole” action.

In this example, we’ll create an IAM user in AWS Account A and configure a cross-account role in AWS Account B that allows the IAM user in AWS Account A to assume it and allow the CrossAccountUser in AWS Account A to read files from buckets in AWS Account B. We’ll need to define an IAM policy granting the necessary permissions and attach that policy to the cross-account role in AWS Account B.

Here’s how you can achieve this:

Set alias for Account A

provider “aws” {
  region = “us-east-1”
#AWS account A access key credentials
 access_key = “A***************”
 secret_key = “U******************”
  alias  = “account_a”
}

Create an IAM user in AWS Account A

resource “aws_iam_user” “cross_account_user” {
  provider = aws.account_a
  name = “CrossAccountUser”
}

Setup AWS Account B

provider “aws” {
  region = “us-east-1”
 #AWS account B access key credentials
 access_key = “A***************”
 secret_key = “U******************”
  alias  = “account_b”
}

Define an IAM role in AWS Account B

resource “aws_iam_role” “cross_account_role” {
  provider = aws.account_b
  name = “CrossAccountRole”
  assume_role_policy = jsonencode({
    Version   = “2012-10-17”,
    Statement = [{
      Effect    = “Allow”,
      Principal = {
        AWS = “arn:aws:iam::Account_A_ID:user/CrossAccountUser”  # Replace [Account A ID] with AWS Account A’s AWS account ID
      },
      Action    = “sts:AssumeRole”,
    }]
  })
}

Define IAM policy to allow reading files from S3 buckets in Account B

# Replace "bucket-name" with the name of your bucket in AWS Account B
resource “aws_iam_policy” “s3_read_policy” {
  provider = aws.account_b
  name = “S3ReadPolicy”
  policy = <<EOF
{
  “Version”: “2012-10-17”,
  “Statement”: [{
    “Effect”: “Allow”,
    “Action”: [
      “s3:GetObject”,
      “s3:ListBucket”
    ],
    “Resource”: [
      “arn:aws:s3:::bucket-name/*”,  
      “arn:aws:s3:::bucket-name”
    ]
  }]
}
EOF
}

Attach IAM policy to the IAM role in AWS Account B

resource “aws_iam_role_policy_attachment” “s3_read_attach_policy” {
  provider = aws.account_b
  role       = aws_iam_role.cross_account_role.name
  policy_arn = aws_iam_policy.s3_read_policy.arn
}

Make sure to replace Account_A_ID with the Account A AWS account ID and “bucket-name” with the name of the s3 bucket in AWS Account B that you want to grant access to. Also, ensure that the bucket policy in AWS Account B allows access from the role CrossAccountRole.

With this setup, the CrossAccountUser in Account A can assume the CrossAccountRole in Account B and access files from the specified s3 bucket in Account B.

Enforcing IAM Best Practices with Policy-as-Code

Enforcing IAM Best Practices with Policy-as-Code ensures that security policies are consistently applied across an organization’s cloud infrastructure. By codifying IAM policies, teams can automate enforcing security controls, reducing the risk of misconfigurations and unauthorized access.

checkov is one of the Policy-as-Code tools available in cloud security. It is an open-source static code analysis tool developed by Bridgecrew.

It scans infrastructure as code (IaC) templates like Terraform and CloudFormation to detect security and compliance issues early. By analyzing configurations against predefined policies and industry standards, Checkov helps identify misconfigurations, vulnerabilities, and compliance violations. It focuses on cloud security, particularly in AWS, Azure, and GCP environments, and integrates seamlessly into CI/CD pipelines for proactive issue remediation before deployment.

Utilizing Checkov, highlighting its ability to detect IAM configuration issues early, focusing on preventing overly permissive policies.

Regarding IAM configuration issues, Checkov plays a crucial role in detecting overly permissive policies early in the development process.

Here’s how Checkov helps in detecting IAM configuration issues, mainly focusing on preventing overly permissive policies:

Static Analysis with Checkov: Configuring Checkov for IAM policy scans.

Let us set up checkov to scan this setup for potential security risks and misconfigurations using the Terraform code example

Example Terraform code we’ll be analyzing:

AWS Provider Configuration – Set the AWS region to us-east-1.

provider “aws” {
  region = “us-east-1”
  access_key = “A*********************”
  secret_key =   “U****************************”
}

IAM Policy for S3 Read Access – Create an IAM policy named s3_read_policy that allows read access (s3:GetObject, s3:ListBucket) to a specified S3 bucket.

resource “aws_iam_policy” “s3_read_policy” {
  name        = “s3_read_policy”
  description = “Allows read access to files in the specified S3 bucket”
  policy      = <<EOF
{
  “Version”: “2012-10-17”,
  “Statement”: [
    {
      “Effect”: “Allow”,
      “Action”: [
        “s3:GetObject”,
        “s3:ListBucket”
      ],
      “Resource”: [
        “arn:aws:s3:::your-bucket-name/*”,
        “arn:aws:s3:::your-bucket-name”
      ]
    }
  ]
}
EOF
}

IAM User Creation – Define an IAM user named iam_user.

resource “aws_iam_user” “iam_user” {
  name = “iam_user”
}

IAM Role and Assume Role Policy – Set up an IAM role named iam-role with an assume role policy that allows the iam_user to assume this role.

resource “aws_iam_role” “iam_role” {
  name = “iam-role”
  assume_role_policy = jsonencode({
    Version = “2012-10-17”,
    Statement = [{
      Action = “sts:AssumeRole”,
      Principal = {
        AWS = “arn:aws:iam::AWS_ACCOUNT_ID:user/${aws_iam_user.iam_user.name}”
      },
      Effect = “Allow”,
      Sid    = “AssumeRole”
    }]
  })
}

Policy Attachment – Attaches the s3_read_policy to the iam_role.

resource “aws_iam_policy_attachment” “s3_read_attach” {
  roles       = [aws_iam_role.iam_role.name]
  policy_arn = aws_iam_policy.s3_read_policy.arn
  name     = “Attaching s3 policy to iam role”
}

Configuring Checkov for IAM Policy Scans

Install Checkov – First, ensure that checkov is installed in your environment. If not, install it via pip by running

pip install checkov

Run checkov Scan -checkov will provide a detailed report of any issues, including security vulnerabilities, best practice violations, and compliance issues. In our example, checkov can help identify potential risks in IAM policies, such as overly broad permissions, and suggest mitigations, etc.

For a file – You can run it for one single file using checkov --file file_name.tf command.

For a directory – You can go to the directory containing your Terraform files to scan all files within the directory and run command.

checkov –d .

checkov output

Refining the Scan – If you want to focus specifically on IAM-related checks, you can use checkov’s –check flag to include or exclude certain checks based on their IDs, tailoring the scan to your specific needs. For example, to ensure IAM policies that allow full “*-*” administrative privileges are not created, you can run:

checkov -d . --check CKV_AWS_62

Output for Terraform code example scan

Results – By running Checkov as described, we can identify potential security issues such as:

Excessive permissions in the IAM policy.
IAM policies are attached directly to users instead of roles.
Missing or overly permissive assume role policies.

Addressing these issues involves modifying the Terraform code to adhere to best practices, such as implementing least privilege, using roles for cross-account access, and ensuring policies are scoped appropriately.

Custom Policies – checkov allows you to enforce specific security or compliance requirements that Checkov’s built-in checks might not cover. This is helpful if your organization has specific compliance requirements that the default checks do not cover.

Integrating Policy Scans in CI/CD: Automating IAM policy compliance checks before deployment.

policy compliance with CI/CD

Wrapping with IAM governance & best practices

Focusing on Identity and Access Management (IAM) governance and best practices is essential for ensuring the security and compliance of cloud environments. This approach helps systematically manage digital identities, their authentication, authorization, roles, and privileges within or across system and enterprise boundaries.

Integrating IAM Governance

IAM governance should be an integral part of any organization’s security strategy. It involves several key components:

Organizations should strive for centralized management of user identities and their access across all systems and platforms. This simplifies the enforcement of access policies and compliance with regulatory requirements.
Assigning permissions based on roles tightly aligned with organizational structures and job functions streamlines access management and enforces the principle of least privilege.
Conducting regular audits of IAM policies and practices helps identify and remediate unused or excessive permissions and ensures compliance with relevant standards and regulations.
Implementing robust processes for the entire lifecycle of user identities – from creation through management, to deletion – ensures that access rights are always up to date and reduces the risk of orphaned accounts.

IAM Best Practices

To enhance IAM governance, organizations should adhere to a set of best practices:

Ensuring that users have only the minimum levels of access required to perform their functions minimizes potential damage from errors or malicious intent.
Use multi-factor authentication (MFA) and strong password policies to enhance security. For critical resources, consider additional authentication factors and stringent authorization checks.
Separate roles and responsibilities to prevent conflicts of interest or fraud. This is crucial in preventing any single point of compromise.
Automate the process of granting and revoking access to minimize the risk of oversight.
Leverage managed policies for easier administration and reuse of standard permission sets.

Methods to perform compliance audits on IAM configurations

Open Policy Agent (OPA) is an open-source, general-purpose policy engine that unifies policy enforcement across the cloud-native stack. It can be incorporated into IaC workflows. OPA enables you to craft policies that govern and secure your cloud environments without embedding policy logic within your applications and enhances their security, compliance, and governance.

OPA Policy Example for Terraform

For creating an Open Policy Agent (OPA) policy relevant to the provided Terraform code example (which involves IAM policies for s3 read access, IAM user, and IAM role creations), we’ll focus on enforcing a rule that IAM policies should specify a specific s3 bucket and not allow broad access.

OPA Policy for Specific S3 Bucket Access in IAM Policies

OPA policies are written in a high-level declarative language called Rego. This policy aims to ensure that any IAM policy granting access to s3 buckets explicitly specifies the bucket name, rather than allowing access to all buckets.

Define a Rego policy file, e.g., iam_policy.rego, that includes the rule to check IAM policy statements for specific S3 bucket access.

package terraform.analysis
default allow = false
# Rule to check for specific bucket access in IAM policies
allow {
    some i
    input.resource.aws_iam_policy[i].policy
    policy := json.unmarshal(input.resource.aws_iam_policy[i].policy)
    policy.Statement[_].Effect == “Allow”
    action_allowed(policy.Statement[_].Action)
    not wildcard_bucket_access(policy.Statement[_].Resource)
}

# Helper to check if actions related to S3 read are allowed
action_allowed(actions) {
    allowed_actions := [“s3:GetObject”, “s3:ListBucket”]
    allowed_actions[_] == actions[_]
}

# Helper to check for wildcard bucket access
wildcard_bucket_access(resources) {
    resources[_] == “arn:aws:s3:::*”
}

This Rego policy does the following:

Checks IAM policies: It looks for aws_iam_policy resources in the Terraform plan.
Parses the policy JSON: It unmarshals the JSON policy document to inspect the policy statements.
Evaluate policy statements: It checks if any “Allow” statements permit s3:GetObject or s3:ListBucket actions.
Ensures specific bucket access: It ensures that resources do not include a wildcard (arn:aws:s3:::*), indicating that the policy specifies particular buckets.

Using the OPA Policy:

To use this policy, you would typically evaluate it against your Terraform plan output in JSON format, using the opa eval command. First, generate the Terraform plan:

terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json

Then, evaluate your policy with OPA:

opa eval --format pretty --data iam_policy.rego --input tfplan.json “data.terraform.analysis.allow”

This OPA policy scrutinizes your Terraform plan, specifically checking whether IAM policies for S3 access are narrowly scoped to specific buckets. Enforcing such policies ensures that your cloud environment adheres to security best practices, significantly mitigating potential risks.

Conclusion

We’ve explored the nuances of managing AWS IAM through Terraform, highlighting its significance in bolstering cloud security, IAM configurations as Infrastructure-as-code, and the critical role of Zero-Trust policies within IAM. Setting up IAM policies, creating users and roles, and managing cross-account access and trust relationships.

The exploration into enforcing IAM best practices through policy-as-code with tools like checkov underscored the transformative impact of static code analysis in preempting configuration errors and security risks.

Finally, we touched upon IAM governance and compliance, underscoring methods like Rego policy definitions with OPA for performing compliance audits on IAM configurations. This ensures alignment with security best practices and regulatory standards, cementing IAM’s role in securing cloud environments.

Commonly Asked Questions

1) What are the best practices for managing AWS IAM policies with Terraform?

Use Least Privilege Principle – Grant only the permissions necessary for a user, group, or role to perform their intended tasks.
Separation of Concerns – Organize IAM policies logically by separating them based on roles, responsibilities, or permissions.
Enable Policy Testing – Implement automated tests to validate IAM policies for correctness and compliance with organizational policies and regulatory requirements.
Rotate IAM Credentials Regularly – Reduce risk and enhance security by automating the rotation of IAM access keys and credentials using AWS Secrets Manager or AWS IAM Access Analyzer.
Use IaC to maintain the IAM policies and changes using Infrastructure-as-a-code to maintain consistency.
Monitor and Audit IAM Changes – Implement and review logging and monitoring of IAM actions and changes using AWS CloudTrail and AWS Config.

2) Can Terraform manage dynamic IAM policies for temporary access?

Yes, Terraform can manage dynamic IAM policies for temporary access using AWS IAM Roles with session policies and AWS Security Token Service (STS)

3) How do I create and manage AWS IAM users and their access keys with Terraform?

Terraform’s AWS provider, iam users, and IAM policies allow you to create and manage AWS IAM users and their access keys.

4) What are instance profiles, and how do they relate to IAM roles in AWS?

Instance profiles associate IAM roles with EC2 instances, allowing the instances to inherit the role’s permissions. When an IAM role is attached to an EC2 instance, the corresponding instance profile is attached. This mechanism enables EC2 instances and other services to securely access AWS resources without requiring long-term credentials like access keys or passwords.

Rethinking code reviews with stacked PRs

Ibrahim Salami — Fri, 06 Sep 2024 15:33:14 +0000

Code review effectiveness

So, how do we strike the right balance? Simple. Use stacked PRs.

What are stacked PRs?

We also wrote a post last year explaining how this help represents code changes as a narrative instead of breaking things down by files or features.

Why stacked PRs?

Other than building a culture of more effective code reviews, there are a few other benefits of stacked PRs:

Early code review feedback

Faster CI feedback cycle

Knowledge sharing

Staying unblocked

What’s the catch?

So, why don’t more developers use stacked PRs for code reviews?

Updating

As we get code review feedback and we make changes to part of the stack, we have to now rebase and resolve conflicts at all subsequent branches.

In this case, you may have a stack that looks like this where demo/schema, demo/backend and demo/frontend represents the 3 branches stacked on top of each other.

So far this makes sense, but what if you got some code review comments on the schema change that requires creating a new commit? Suddenly your commit history looks like this:

Now you have to manually rebase all subsequent branches and resolve conflicts at every stage. Imagine if you have 10 stacked branches where you may have to resolve the conflicts 10 times.

Merging

In the case of a squash commit, Git takes changes from all the existing commits of the PR and rewrites them into a single commit. In this case, no history is maintained on where those changes came from
A merge commit is a special type of Git commit that is represented by a combination of two or more commits. So, it works very similar to a squash commit but it also captures information about its parents. In a typical scenario, a merge commit has two parents: the last commit on the base branch (where the PR is merged) and the top commit on the feature branch that was merged. Although this approach gives more context to the commit history, it inadvertently creates non-linear git-history that can be undesirable.
Finally, in case of a rebase and merge, Git will rewrite the commits onto the base branch. So similar to squash commit option, it will lose any history associated with the original commits.

That’s why we built stacked PRs management as part of Aviator.

Why Aviator CLI is different

Think of Aviator as an augmentation layer that sits on top of your existing tooling. Aviator connects with GitHub, Slack, Chrome, and Git CLI to provide an enhanced developer experience.

Aviator CLI works seamlessly with everything else! The CLI isn’t just a layer on top of Git, but also understands the context of stacks across GitHub. Let’s consider an example.

Creating a stack

av stack branch demo/schema
# make schema changes
git commit -a -m "[demo] schema changes"
av pr create

Since Aviator is also connected to your GitHub, it makes it easy for you to visualize the stack.

Or if you want to visualize it from the terminal, you can still do that with the CLI commands:

Updating the stack

Merging the stack

Aviator automatically handles validating, updating, and auto-merging all queued stacks in order.

Now when the PRs are merged, you can this time run av stack sync --trunk to update all PRs and clean out all merged PRs.

Shift-Left is the future

At Aviator, we are building developer productivity tools from first principles to empower developers to build faster and better.

Scanning AWS S3 Buckets for Security Vulnerabilities

Ibrahim Salami — Tue, 27 Aug 2024 18:55:23 +0000

All cloud providers offer some variations of file bucket services. These file bucket services allow users to store and retrieve data in the cloud, offering scalability, durability, and accessibility through web portals and APIs. For instance, AWS offers Amazon Simple Storage Service (S3), GCP offers Google Cloud Storage, and DigitalOcean provides Spaces. However, if unsecured, these file buckets pose a major security risk, potentially leading to data breaches, data leakages, malware distribution, and data tampering. For example, the United Kingdom Council’s data on member’s benefits was exposed by an unsecured AWS bucket. In another incident in 2021, an unsecured bucket belonging to a non-profit cancer organization exposed sensitive images and data for tens of thousands of individuals.

Thankfully, S3Scanner can help. S3Scanner is a free and easy-to-use tool that can help you identify and fix unsecured file buckets in all major cloud providers: Amazon S3, Google Cloud Storage, and Spaces:

In this article, you’ll learn all about S3Scanner and how it can help identify unsecured file buckets on multiple cloud providers.

Common Security Risks in Amazon S3 Buckets

Amazon S3 buckets offer a simple and scalable solution for storing your data in the cloud. However, just like any other online storage platform, there are security risks you need to be aware of.

Following are some of the most common security risks associated with Amazon S3 buckets:

Unintentional public access: Misconfiguration, such as overly permissive permissions (ie granting public read access), can cause insecure bucket policies and permissions, which can result in unauthorized users being able to access and perform actions on your S3 bucket.
Insecure bucket policies and permissions: S3 buckets use identity and access management (IAM) to control access to data. This allows you to define permissions for individual users and groups using bucket policies. If your bucket policies are not properly configured, it can give unauthorized users access to your data (eg policies using wildcard). Poorly configured IAM settings can also result in compliance violations due to unauthorized data access or modification, which impacts regulatory requirements and can expose the organization to legal consequences.
Data exposure and leakage: Even if your S3 bucket isn’t public, data can still be exposed. For instance, data can be exposed if you accidentally share the URL of an object with someone else or if there are overly permissive permissions for that bucket. Additionally, data exposure can occur if you download data from your S3 bucket to an insecure location.
Lack of encryption: The lack of encryption for data stored in S3 buckets is another significant security risk. Without encryption, intercepted data during transit or compromised storage devices may expose sensitive information.

Managing AWS access control and encryption options can be difficult. For instance, AWS has numerous tools, ranging from intricate access controls to robust encryption options, that help to protect your data and accounts from unauthorized access. Navigating this wide range of tools can be daunting, especially for individuals who don’t have a background in security. A single policy misconfiguration or permission can leave sensitive data exposed to unintended audiences.

This is where S3Scanner could be useful.

What Is S3Scanner

S3Scanner is an open source tool designed for scanning and identifying security vulnerabilities in Amazon S3 buckets:

S3Scanner supports many popular platforms including:

AWS (the subject platform of this article)
GCP
Digital Ocean
Linode
Scaleway

You can also use S3Scanner with custom providers such as your own bespoke bucket solution. This makes it a versatile solution for various organizations.

Please note, for non AWS services, S3Scanner currently only supports scanning for anonymous user permissions.

The following command shows S3Scanner basic usage to scan for buckets listed in a file called names.txt and enumerate the objects.

$ s3scanner -bucket-file names.txt -enumerate

The following are some of S3Scanner’s key features:

Multithreaded Scanning

S3Scanner uses multithreading capabilities to concurrently assess multiple S3 buckets, optimizing the speed of vulnerability detection. To specify the number of threads to use, you can use the -threads flag and then provide the number of threads you want to use.

For instance, if you want to use ten threads, you’ll use the following command:

s3scanner -bucket my_bucket -threads 10

Config File

If you’re using flags that require config options like custom providers, you’ll need to create a config file. To do so, create a file named config.yml and put it in one of the following locations where S3Scanner will look for it:

(current directory)
/etc/s3scanner/
$HOME/.s3scanner/

Built-In and Custom Storage Provider Support

As previously stated, S3Scanner seamlessly integrates with various providers. You can use the -provider option to specify the object storage provider when checking buckets.

For instance, if you use GCP, you’d use the following command:s3scanner -bucket my_bucket -provider gcp

To use a custom provider when working with currently unsupported or a local network storage provider, the provider value should be custom like this:s3scanner -bucket my_bucket -provider custom

Please note that when you’re working with a custom provider, you also need to set up config file keys under providers.custom, as listed in the config file. Some examples include address_style, endpoint_format, and insecure. Here’s an example of a custom provider config:

# providers.custom required by `-provider custom`
#   address_style - Addressing style used by endpoints.
#     type: string
#     values: "path" or "vhost"
#   endpoint_format - Format of endpoint URLs. Should contain '$REGION' as placeholder for region name
#     type: string
#   insecure - Ignore SSL errors
#     type: boolean
# regions must contain at least one option
providers:
  custom: 
    address_style: "path"
    endpoint_format: "https://$REGION.vultrobjects.com"
    insecure: false
    regions:
      - "ewr1"

Comprehensive Permission Analysis

S3Scanner provides access scans by examining bucket permissions. It identifies misconfigurations in access controls, bucket policies, and permissions associated with each S3 bucket.

PostgreSQL Database Integration

S3Scanner can save scan results directly to a PostgreSQL database. This helps maintain a structured and easily accessible repository of vulnerabilities. Storing results in a database also enhances your ability to track historical data and trends.

To save all scan results to a PostgreSQL, you can use the -db flag, like this:s3 scanner -bucket my_bucket -db

This option requires the db.uri config file key in the config file. This is what your config file should look like:

# Required by -db
db:
uri: "postgresql://user:password@db.host.name:5432/schema_name"

RabbitMQ Connection for Automation

You can also integrate with RabbitMQ, which is an open source message broker for automation purposes. This allows you to set up automated workflows triggered by scan results or schedule them for regular execution. Automated responses can include alerts, notifications, or further actions based on the identified vulnerabilities, ensuring proactive and continuous security.

The -mq flag is used to connect to a RabbitMQ server, and it consumes messages that contain the bucket names to scan:s3scanner -mq

The -mq flag requires mq.queue_name and mq.uri keys to be set up in the config file.

Customizable Reporting

With S3Scanner, you can generate reports tailored to your specific requirements. This flexibility ensures that you can communicate findings effectively and present information in a format that aligns with your organization’s reporting standards.

For instance, you can use the -json flag to output the scan results in JSON format:s3scanner -bucket my-bucket -json

Once the output is in JSON, you can pipe it to jq, a command-line JSON processor, or other tools that accept JSON, and format the fields as needed.

How S3Scanner Works

To use S3Scanner, you need to install it on your system. The tool is available on GitHub, and the installation instructions vary based on your platform. Currently, supported platforms include Windows, Mac, Kali Linux, and Docker.

The installation steps for the various platforms and version numbers are shown below:

Platform: Homebrew (MacOS)
- Version: v3.0.4
- Steps: brew install s3scanner
Platform: Kali Linux
- Version: 3.0.0
- Steps: apt install s3scanner
Platform: Parrot OS
- Version: –
- Steps: apt install s3scanner
Platform: BlackArch
- Version: 464.fd24ab1
- Steps: pacman -S s3scanner
Platform: Docker
- Version: v3.0.4
- Steps: docker run ghcr.io/sa7mon/s3scanner
Platform: Winget (Windows)
- Version: v3.0.4
- Steps: winget install s3scanner
Platform: Go
- Version: v3.0.4
- Steps: go install -v github.com/sa7mon/s3scanner@latest
Platform: Other (build from source)
- Version: v3.0.4
- Steps: git clone git@github.com:sa7mon/S3Scanner.git && cd S3Scanner && go build -o s3scanner .

For instance, on a Windows system, you would use winget and run the following command: winget install s3scanner. Your output would look like this:

Found S3Scanner [sa7mon.S3Scanner] Version 3.0.4
This application is licensed to you by its owner.
Microsoft is not responsible for, nor does it grant any licenses to, third-party packages.
Downloading https://github.com/sa7mon/S3Scanner/releases/download/v3.0.4/S3Scanner_Windows_x86_64.zip
  ██████████████████████████████  6.52 MB / 6.52 MB
Successfully verified installer hash
Extracting archive...
Successfully extracted archive
Starting package install...
Command line alias added: "S3Scanner"
Successfully installed

The last sentence shows that S3Scanner was successfully installed.

If you want to avoid installing S3Scanner via the above methods, you can also use the Python Package Index (PyPI). To do so, search for S3Scanner on PyPI:

And select the first option that appears (ie S3Scanner):

Create and navigate to a directory of your choosing (eg s3scanner_directory) and run the command pip install S3Scanner to install it.

Please note that you need to have Python and pip installed on your computer to be able to run the pip command.

Your output looks like this:

Collecting S3Scanner
  Downloading S3Scanner-2.0.2-py3-none-any.whl (15 kB)
Requirement already satisfied: boto3>=1.20 in c:\python\python39\lib\site-packages (from S3Scanner) (1.34.2)
Requirement already satisfied: botocore<1.35.0,>=1.34.2 in c:\python\python39\lib\site-packages (from boto3>=1.20->S3Scanner) (1.34.2)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in c:\python\python39\lib\site-packages (from boto3>=1.20->S3Scanner) (1.0.1)
Requirement already satisfied: s3transfer<0.10.0,>=0.9.0 in c:\python\python39\lib\site-packages (from boto3>=1.20->S3Scanner) (0.9.0)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in c:\python\python39\lib\site-packages (from botocore<1.35.0,>=1.34.2->boto3>=1.20->S3Scanner) (2.8.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\python\python39\lib\site-packages (from botocore<1.35.0,>=1.34.2->boto3>=1.20->S3Scanner) (1.26.18)
Requirement already satisfied: six>=1.5 in c:\python\python39\lib\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.35.0,>=1.34.2->boto3>=1.20->S3Scanner) (1.16.0)
Installing collected packages: S3Scanner
Successfully installed S3Scanner-2.0.2

This confirms that the S3Scanner was successfully installed.

Configure Scanning Parameters

Before running any scans, you need to make sure everything is working and configure your scanning parameters.

Run one of the following command to make sure S3Scanner is configured correctly:

s3scanner -h

s3scanner --help

You should receive some information about the various options you can use when scanning buckets:

usage: s3scanner [-h] [--version] [--threads n] [--endpoint-url ENDPOINT_URL]
                 [--endpoint-address-style {path,vhost}] [--insecure]
                 {scan,dump} ...

s3scanner: Audit unsecured S3 buckets
           by Dan Salmon - github.com/sa7mon, @bltjetpack

optional arguments:
  -h, --help            show this help message and exit
  --version             Display the current version of this tool
  --threads n, -t n     Number of threads to use. Default: 4
  --endpoint-url ENDPOINT_URL, -u ENDPOINT_URL
                        URL of S3-compliant API. Default: https://s3.amazonaws.com
  --endpoint-address-style {path,vhost}, -s {path,vhost}
                        Address style to use for the endpoint. Default: path
  --insecure, -i        Do not verify SSL

mode:
  {scan,dump}           (Must choose one)
    scan                Scan bucket permissions
    dump                Dump the contents of buckets

If you have the AWS Command Line Interface (AWS CLI) installed and have AWS credentials specified in the .aws folder, S3Scanner will pick up these credentials for use when scanning. Otherwise, you have to install the AWS CLI to be able to pick buckets in your environment:

Run Scans and Interpret Results

To run a scan, you need to run s3scanner and provide the flags, such as scan or dump, and the name of the bucket. For example, to scan for permissions on a bucket called my-bucket, you would run s3scanner scan --bucket my-bucket.

This gives you a similar output to the following (the columns are delimited by the pipe character, |):

my-bucket | bucket_exists | AuthUsers: [], AllUsers: []

The first portion of the output gives you the name of the bucket, and it tells you if that bucket exists in the S3 universe. The last portion of the output shows you the permissions attributable to authenticated users (anyone with an AWS account) as well as all users.

Run a scan command for a bucket that is in your AWS environment, such as ans3scanner-bucket, like this:

You should get the following output:

ans3scanner-bucket | bucket_exists | AuthUsers: [Read], AllUsers: []

This output shows that the bucket has authenticated users granted [Read] rights.

Scan Your GCP Buckets

To test your GCP buckets, create a bucket in your GCP account and make sure it doesn’t have public access:

Inside the bucket, add a text file:

To scan the bucket, run the previously mentioned command: s3scanner -bucket s3scanner-demo -provider gcp. You have to provide the -provider gcp flag to tell S3Scanner that you want to scan a GCP bucket. If you don’t provide this flag, S3Scanner uses AWS (the default option).

Your output shows that a bucket exists:

level=info msg="exists    | s3scanner-demo | default | AuthUsers: [] | AllUsers: []"

Now, change the GCP bucket access to “public” and grant all users access:

Then, scan the GCP bucket. Your output will show that the bucket is available to all users:

level=info msg="exists    | s3scanner-demo | default | AuthUsers: [] | AllUsers: [READ, READ_ACP]"

Best Practices for Remediation

After you review the results of your scan, make sure to prioritize the identified issues based on their severity. Some common remediations are as follows:

Adjust bucket permissions: You can restrict access to buckets by adjusting permissions and policies to adhere to the principle of least privilege. Make sure to remove unnecessary public access and ensure that only authorized entities have the required permissions.
Regularly audit and monitor your S3 bucket configurations: Establish a routine for auditing and monitoring your S3 bucket configurations. You can also set up alerts for any changes to permissions or policies, enabling timely detection and response to potential security incidents. Additionally, you can utilize tools and services such as AWS Config, which helps you assess, audit, and evaluate the configuration of your resources. Moreover, AWS Trusted Advisor helps inspect your environment and provides recommendations to improve security, performance, and cost.
Encrypt data: Securing data through encryption involves implementing measures for both in transit and at rest. For data that is in transit, employing secure communication channels like HTTPS during transfer ensures that information remains encrypted between clients and servers. On the server side, AWS S3 offers different options for encrypting data at rest.

Conclusion

In this article, you learned about some of the common security risks associated with Amazon S3 buckets and how S3Scanner can help.

S3Scanner is a valuable tool for anyone leveraging cloud storage through buckets because it helps you scan for vulnerabilities in your environment. With multithreaded scanning, comprehensive permission analysis, custom storage provider support, PostgreSQL database integration, and customizable reporting, S3Scanner is definitely worth exploring.

The irrational fear of deployments

Ibrahim Salami — Fri, 09 Aug 2024 21:38:41 +0000

A recent outage involving CrowdStrike impacted 8.5 million Windows operating systems, leading to disruptions in various global services, including airlines and hospitals. Multiple analyses have examined the root cause of this incident itself.

However, as a software engineer, I think we are missing the aspect of human emotions related to deployments, specifically the fear of breaking production. That’s what we will try to dive into in this article. We will cover:

Understanding the function of release engineering.
What software engineers care about and what they don’t.
Impact of continuous delivery (CD).
A look at manual deployments.
Problems with manual deployment and the solution to these problems.

Release Engineering

Before delving into the fear of deployments from a software engineer’s perspective, let’s first understand the role of a release engineer.

Release engineering has evolved considerably in recent years, thanks to the modern CI and CD tools and standardization of Kubernetes. Despite these advancements, the primary responsibilities remains the same:

Consistent and repeatable deployments: Standardizing release processes, reduces the risk of bad deployments to production.
Reducing service disruptions: Standardized processes also ensure teams are equipped to tackle harmful production environment incidents—for example, a rollback strategy for scenarios where a release causes problems.
Monitor and Optimize Performance: Look for performance improvements for faster and reliable deployments.
Collaborate with engineering: Work closely with developers, QA, and DevOps teams to ensure all new and existing services have a well defined deployment process.

What Software Engineers Care About

Unlike the release engineers, as a software engineer working in the product team we may only care about certain aspects of deployments:

Quick code merges: Merging quickly allows them to validate their work and move on to new tasks or unblock dependent tasks.
Production incidents: Although engineers may not care about all production incidents, they definitely care about their code changes causing any production outages.
Deployment schedule: Engineers also like to track when their changes go live or have gone live, so that they can have access to real-time feedback on their changes.

What Software Engineers Don’t Care About

Although there are things we care about, there are also those we don’t:

Deployment methodology: Although we know the need for an efficient and reliable deployment process, they don’t care how it is performed.
Effect of other changes: Unless things go wrong, we don’t worry about unrelated changes from other developers.
Deployment management: An engineer is indifferent to who manages deployment in a software team. For instance, we would only care about managing deployment if tasked with doing so.

Impact of Continuous Deployments (CD)

So what does the fear have to do with Continuous deployments?

A lot.

Studies have proven several benefits of Continuous Deployment (CD), and unsurprisingly many of which are psychological in nature. Continuous deployments removes “human-in-the-loop”, therefore it requires a strong trust in the test infrastructure.

In other words, automated tests are not only ensuring reliability of production, but also providing psychological safety, sometimes irrationally, reducing the fear of deployments. As a developer, I’m more comfortable making changes in a CD process vs if I’m asked to verify the changes manually.

However, despite the popularity of these CD strategies, a lot of companies still trigger deployments manually (have a human-in-the-loop), indicating a cautious approach to CD implementations. This behavior suggests that teams prefer to retain supervision on the release process and intervene where necessary.

This is important to understand from a psychological safety perspective. Manual deployments imply that someone is overseeing the process and handling issues when things go wrong. While this provides a sense of security, it can also induce fear in the person deploying and is prone to human error.

Manual deployments

Despite the drawbacks most teams manage deployments manually. A typical manual deployment may include a few steps:

Supervision

Someone babysits the entire deployment process before a release goes out. This person is tasked with intervening when and if there are signs of trouble. Teams maintain an oncall person who manages their deployments and handles problems when they arise.

Dedicated Release Teams

Some teams have a dedicated release engineering team, which ensures releases go smoothly. Since this means a high degree of specialization, the deployment process could be more efficient and reliable.

Spreadsheets

Some companies maintain a spreadsheet to validate any changes made. This allows companies to systematically review and approve these changes, ensuring they meet predefined quality standards.

Manual QA

In addition to spreadsheets, manual QA is another layer companies add. Manual QA tests new releases in staging environments before deploying them to production. However, a testing environment isn’t foolproof, so that some real-life scenarios won’t be accounted for.

Where Do Things Go Wrong With Manual Deployments?

Many things can go wrong for any software development team relying solely on manual deployments:

Dependence on a small group

This can create bottlenecks, which lead to release delays and human error in some instances. Also, a team could have problems when this specific person leaves or can’t deliver on the required tasks.

No risk-mitigation strategy

There is no strategy for following through in an unfavorable production incident. When an incident happens, the release team has to grapple to find the relevant stakeholders to help resolve and make decisions.

Prone to human error

Typographical errors in commands or scripts, or forgot to run the pre-deployment or post-deployment steps.

High effort

Since the deployments require babysitting the process,it becomes a time consuming effort. Also causing the frequency of deployments to drop significantly. For instance, if it requires an hour to monitor the entire deployment, the release team may decide to skip deployments on the days with minor changes to save that time.

Communication Breakdown

It’s unclear from product teams on the state of the releases and when their changes are getting into production.

Looking at these challenges, it’s easy to understand why engineers dread deployments. The risk of deployment failures, the high stakes, and the pressure to keep downtime low also contribute to this fear.

These failures can be minimized by increasing test automation. Still, since these tests are carried out in a test environment, you should not expect an automated test to catch every possible error. Failures are to be expected but at a reduced rate.

What can we do about it?

Simply set up Continuous Deployments? Easier said than done. Despite the drawbacks, manual deployments are still okay if managed well. The goals should be:

provide guardrails to avoid production incidents
reduce human errors
enable anyone to trigger deploys
ensure deployments happen frequently

Guardrails – Canary and Rollbacks

Canary and Rollback strategies can help reduce the impact of an outage and in many cases avert the crisis automatically.

A canary release exposes your new release to a small portion of production environment traffic. This gives teams insight into issues that might not have come up during testing.

On the other hand, a rollback strategy helps engineers revert a release to its previous stable version state. It is done when new problems arise after deployments to the production environment.

Reduce human errors – Standardization

Define standard deployment methodologies that result in efficiency, consistency, reliability, and high software quality. In their state of DevOps report, DORA shows that reliability predicts better operational performance. Furthermore, having a standardized process allows repeatability in release processes, which can be automated. Automating this process helps a team keep production costs lower.

Democratize deployment process

Democratizing the deployment process removes the reliance on specific individuals. If we empower any software engineer to deploy, it slowly reduces the fear. “If “anyone can deploy it should not be too hard”. Share your legos!

Frequent deployments

To reduce deployment anxiety, we need to deploy more frequently, not less. The DORA report also highlights that smaller batch deployments are less likely to cause issues and help lower the psychological barrier for developers.

Improve developer experience

Clarifying what is being deployed enhances the developer experience. Make it easy for developers to know when deployments occur and what changes are included. This transparency helps developers track when their changes go live and simplifies incident investigations.

Defined risk-mitigation strategies

There should be defined steps to follow for rollbacks and hotfixes, as this helps eliminate any indecision with production incidents. For instance, there should be separate build and deploy steps for teams to follow for easy rollbacks

Similarly, standardizing how to deal with hotfixes and cherrypicks can make it simple to operate when the stakes are high.

Feature flags

Feature flags are like kill-switches that can turn off a new feature that caused an incident in production. This can enable engineers to resolve production incidents quickly.

Conclusion

Software teams must treat release engineering as a priority from the outset of product development to avoid costly mistakes. And we should not let incidents like the Crowdstrike outage cripple our development practices. Addressing the fear of deployment and preventing production incidents involves several key strategies:

Invest in the standardization of deployment processes
Set up well-defined risk-mitigating strategies, such as canary releases, strategic rollouts, rollbacks, and hotfixes.
Simplify the developer experience by democratizing deployments, and encourage everyone to participate.

Comparing Flux CD, Argo CD, and Spinnaker

Ibrahim Salami — Fri, 26 Jul 2024 19:02:53 +0000

Continuous delivery (CD) tools play a crucial role in modern software development workflows, enabling teams to automate the process of deploying applications. Among the available CD tools, Flux CD, Argo CD, and Spinnaker stand out for their unique features and capabilities. This article provides an in-depth comparison of these three tools. In it, we’ll explore their architectures, key features, integration capabilities, and ideal use cases, and we’ll go into each tool’s basic implementation.

Comparing Flux CD, Argo CD, and Spinnaker is essential for organizations seeking the right CD tool to fit their specific requirements. By understanding the architectural differences, key features, and integration capabilities of each tool, teams can make informed decisions and optimize their deployment workflows.

Brief introduction to Flux CD, Argo CD, and Spinnaker

Flux CD, Argo CD, and Spinnaker are prominent players in the field of CD tools — each offers a unique approach to application deployment and management.

- Flux CD: Flux CD, or Flux, is an open-source tool that follows the GitOps methodology, where the desired state of the system is controlled in Git repositories. It continuously monitors these repositories for changes and automatically applies them to the Kubernetes cluster.
- Argo CD: Argo CD is another open-source tool designed for Kubernetes-native continuous deployment. It utilizes declarative YAML manifests in a GitHub repository to define the desired application state and synchronizes that with the actual state in the Kubernetes cluster.
- Spinnaker: Spinnaker is a more compact CD platform that provides support for multicloud deployments. It offers advanced features such as automated canary analysis and pipeline orchestration, making it suitable for complex deployment scenarios.

Architecture

Flux CD

Flux is constructed with GitOps Toolkit components. In the Flux ecosystem, those components are Flux Controllers, composable APIs, and reusable Go packages. They’re used for developing CD workflows on Kubernetes using GitOps principles.

Key components of Flux CD include the source controller, which establishes a collection of Kubernetes entities, enabling cluster administrators and automated operators to manage Git and Helm repository tasks through a dedicated controller.

You have the option of using the toolkit for expanding Flux capabilities and creating custom systems tailored for continuous delivery. A recommended starting point for this is the source-watcher guide.

Argo CD

Argo CD operates as a Kubernetes controller, continually monitoring active applications and comparing their existing operational state with the intended target state defined in a Git repository. Applications that do not match the desired state are flagged as out of sync. After that, Argo CD provides reporting and visualization of these disparities, offering options for automatic or manual synchronization to bring the operational state in line with the desired target state.

Any modifications made to the desired target state in the Git repository are automatically applied and reflected in the specified target environments (usually a Kubernetes cluster). All the changes made are also displayed in the Argo CD UI.

This architecture ensures automated application deployment and lifecycle management, aligning with the GitOps pattern of using Git repositories as the source of truth for defining application states. Argo CD supports various methods of specifying plain directories of YAML/JSON manifests, Kubernetes manifests, including kustomize applications, Helm charts, and Jsonnet files.

Argo CD provides a CLI for automation and integration with CI pipelines, webhook integration with version control systems, and so on.

Spinnaker

Spinnaker employs a microservices architecture comprising several components that interact to facilitate the deployment process. Core components of Spinnaker include the Deck UI for user interaction, the Gate API for authentication and authorization, and various cloud-specific Clouddriver services for interacting with cloud providers.

The diagram below illustrates the interdependencies among microservices. The green rectangles denote “external” elements, such as the Deck UI, a single-page JavaScript application operating within your web browser. The gold rectangles signify Halyard components, which are utilized solely during the configuration of Spinnaker.

Key features

Flux CD

- GitOps-based continuous delivery: Flux CD leverages Git repositories as the source of truth for defining the desired state of the system.
- Automated deployments: *Flux CD automates the deployment process based on changes detected in Git repositories.
*- Git repository synchronization: Flux CD synchronizes Kubernetes resources with Git repositories, ensuring consistency between environments.

Argo CD

- Declarative GitOps application deployment: Argo CD enables declarative application deployments using YAML manifests stored in Git repositories.
- Rollback and version control: Argo CD supports rollback functionality and maintains version control for application configurations.
- SSO integration: Argo CD provides integration with single sign-on (SSO) systems for authentication and access control.

Spinnaker

- Multi-cloud support: Spinnaker offers native support for multiple cloud providers, allowing easy deployment across heterogeneous environments.
- Automated canary analysis: Spinnaker facilitates automated canary analysis for evaluating new versions of applications before pushing them to production.
- Pipeline orchestration: Spinnaker provides robust pipeline orchestration capabilities, enabling complex deployment workflows.

Integration and extensibility

Flux CD

- Integration with Kubernetes and Helm: Flux CD integrates easily with Kubernetes and Helm for managing containerized applications.
- Extensibility through custom controllers: Flux CD allows extending the Kubernetes API with custom resource definitions and validation webhooks.

Argo CD

- Kubernetes native integration: Argo CD is tightly integrated with Kubernetes, leveraging custom resource definitions (CRDs) for managing application deployments.

Spinnaker

- Integration with major cloud providers: Spinnaker provides out-of-the-box integration with major cloud providers such as AWS, Google Cloud Platform (GCP), and Microsoft Azure.
- Extensibility through custom stages and plugins: It supports extensibility through custom stages and plugins, allowing users to integrate with additional services and tools.

Use cases and best practices

Flux CD

Flux CD is suitable for small- to medium-scale Kubernetes deployments. It’s ideal for teams practicing GitOps methodologies, where the entire deployment process is managed through version-controlled Git repositories. It’s more flexible than Argo CD.

Argo CD

Argo CD is good for DevOps teams looking for Kubernetes-native continuous deployment solutions. It’s recommended for CI/CD pipelines requiring declarative application definitions stored in Git repositories.

Spinnaker

Spinnaker is recommended for enterprises with complex, multi-cloud deployment requirements because of its robust multi-cloud support. It’s ideal for organizations needing advanced CD workflows, including canary deployments and automated analysis. It’s more flexible than Flux CD and Argo CD but harder to get started with.

Examples of how to use Flux CD, Argo CD, and Spinnaker

This section will cover the basics of how to set up and use Flux CD, Argo CD, and Spinnaker — it’s meant to give you an idea of what you’re getting into before you implement a CD tool in a real project. To follow the steps, you should have a cluster running.

How to use Flux CD

Using Flux CD involves setting up a Git repository to store your Kubernetes manifests and configuring Flux CD to synchronize these manifests with your Kubernetes cluster. Here’s a step-by-step guide:

Step 1: Install Flux CD

You need to install the Flux CLI to run commands on. With Bash for macOS and Linux, you can use the following command (you can get other installation methods in the CLI install documentation):

curl -s https://fluxcd.io/install.sh | sudo bash

You can check whether it installed properly with the following command:

flux check --pre # use sudo if you get error like "connection refused"

Step 2: Configure GitHub credentials

Flux needs your GitHub credentials in order to log in and perform some actions on your repository. Export your GitHub personal access token and username:

`export GITHUB_TOKEN=

export GITHUB_USER=
`

Step 3: Install Flux CD onto your cluster

The flux bootstrap github command sets up the Flux controllers on a Kubernetes cluster and sets them to synchronize the cluster’s state with a Git repository. It also uploads the Flux manifests to the Git repository and sets up Flux CD to automatically update itself based on changes in the Git repository.

To do do this run the following command:

`echo $GITHUB_TOKEN | flux bootstrap github \

--owner=$GITHUB_USER \

--repository= \

--branch=main \

--path=./flux-clusters \

--personal

--private=false
`

The bootstrap command above does the following:

Creates a Git repository (in my case, flux-test-app ) on your GitHub account.
Adds Flux component manifests to the repository.
Deploys Flux components to your Kubernetes cluster. You can run kubectl get all -n flux-system to check out the components.
Configures Flux components to track the path /flux-clusters in the repository.
–private=false flag is used to create a public repository.

Your output will look like this:

Step 4: Add Podinfo repository to Flux CD (or any repository you want)

First, clone the repository you created (in my case, flux-test-app) to your local machine:

`git clone https://github.com/$GITHUB_USER/flux-test-app

cd flux-test-app`

Now run the following to create a GitRepository manifest pointing to the github.com/stefanprodan/podinfo master branch. Podinfo is a web application written in Go.

`flux create source git podinfo \

--url=https://github.com/stefanprodan/podinfo \

--branch=master \

--interval=2m \

--export > ./flux-cluster/podinfo-source.yaml
`

In the command above:

A GitRepository named podinfo is created.
The source-controller checks the Git repository every two minutes, as indicated by the –interval flag.
It clones the master branch of the https://github.com/stefanprodan/podinfo repository.
When the current GitRepository revision differs from the latest fetched revision, a new Artifact is archived.

After the command is run, you should have the corresponding file podinfo-source.yaml.

Step 5: Deploy the podinfo application using GitOps

Configure Flux CD to build and apply the kustomize directory located in the podinfo repository. This directory contains the Kubernetes deployment files.

Use the following flux create command to create a Kustomization that applies the podinfo deployment:

`flux create kustomization podinfo \

--target-namespace=default \

--source=podinfo \

--path="./kustomize" \

--prune=true \

--wait=true \

--interval=10m \

--retry-interval=2m \

--export > ./flux-cluster/podinfo-kustomization.yaml
`

In the command above:

A Flux GitRepository named podinfo is created that clones the master branch and makes the repository content available as an Artifact inside the cluster.
A Flux Kustomization named podinfo is created that watches the GitRepository for Artifact changes.
The Kustomization builds the YAML manifests located at the specified location in –path=”./kustomize”, validates the objects against the Kubernetes API, and applies them on the cluster.
The –interval=10m flag, every ten minutes, sets the Kustomization to run a server-side dry-run to detect and correct drift inside the cluster.
The –retry-interval=2m specifies the interval (two minutes) at which to retry a failed reconciliation.
When the Git revision changes, the manifests are reconciled automatically. If previously applied objects are missing from the current revision, these objects are deleted from the cluster when enabled with –prune=true.

After the command is run you should have the corresponding file podinfo-kustomization.yaml.

Now commit and push the manifests to the repository:

`git add -A && git commit -m "Add podinfo manifests"

git push`

After about ten minutes, your application should be running on your cluster. You can check with the following command:

sudo kubectl -n default get deployments,services

Output:

How to use Argo CD

To use Argo CD, you typically install Argo CD onto your Kubernetes cluster, deploy your applications to Kubernetes, configuring Argo CD to watch your application manifests in a Git repository, and then let Argo CD synchronize the desired state of your applications with the actual state running in your cluster.

Here’s a basic guide to get started:

Step 1: Install Argo CD onto your Kubernetes cluster

You can install Argo CD using Kubernetes manifests. Below is an example of how you can install Argo CD using kubectl:

`kubectl create namespace argocd

kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml`{% endraw %}

Also install the Argo CD CLI to run the argocd commands in later steps.

Now change the argocd-server service type to LoadBalancer with the following command:

{% raw %}kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'

Step 2: Access the Argo CD UI

Once it’s installed, you can access the Argo CD UI via a port forward or by exposing the service externally. Here’s how to port forward:

kubectl port-forward svc/argocd-server -n argocd 8080:443

You can then access the Argo CD UI by navigating to http://localhost:8080 in your web browser.

The initial password for the admin (login username) account is automatically generated and saved as plain text in the password field within a secret named argocd-initial-admin-secret in your Argo CD installation namespace. To easily obtain this password, you can run the following argocd admin command:

argocd admin initial-password -n argocd

Using the username admin and the password from above, log in to Argo CD’s host:

argocd login https://localhost:8080/

Step 3: Creating an app on Kubernetes

First, you need to set the current namespace from default to argocd by running the following command:

kubectl config set-context --current --namespace=argocd

Next, deploy a sample application to the Kubernetes cluster using YAML manifests. This manifest is on GitHub so you can check out the content.

Create the example application with the following command:

argocd app create guestbook --repo https://github.com/khabdrick/argocd-example-apps.git --path . --dest-server https://kubernetes.default.svc --dest-namespace default

If you’re using a different repository, update https://github.com/khabdrick/argocd-example-apps.git –path . in the code as appropriate.

In the Argo CD UI, you will see that your app has been deployed and synchronized successfully.

Argo CD will now start monitoring the Git repository for changes and automatically synchronize the application to the desired state specified in the manifests. It takes about three minutes for Argo CD to refresh automatically and synchronize and apply changes in the repository.

This is a basic guide to get started with Argo CD. Depending on your specific use case and requirements, you may need to explore more advanced features and configurations.

How to use Spinnaker

To install Spinnaker, you need Halyard. Halyard is a tool used to configure and manage Spinnaker deployments. This section outlines the process of setting up Spinnaker with a MySQL database on Kubernetes. We’ll start by running Halyard in a Docker container.

Note: For this section, I will use a Kubernetes cluster from Docker Desktop.

Setting up a MySQL database

To begin, deploy a MySQL database using Kubernetes and the MariaDB Docker image.

(Try to use a more secure password.)

kubectl run mysql --image=mariadb:10.2 --env="MYSQL_ROOT_PASSWORD"="123" --env="MYSQL_DATABASE"="front50"

This command creates a MySQL instance named mysql, setting the root password and creating a database named front50. This will be used to configure Front50. Front50 serves as the persistent storage and retrieval mechanism for Spinnaker’s pipeline configurations, application details, and other metadata.

Configuring Halyard

Next, we configure Halyard by creating a container that runs Halyard:

`docker run --name halyard --rm \

-v ~/.kube:/home/spinnaker/.kube \

-it us-docker.pkg.dev/spinnaker-community/docker/halyard:stable`

In another terminal window, enter the Halyard container:

docker exec -it halyard bash

Once inside the Halyard container, configure the Spinnaker version:

`hal config version

hal config version edit --version `

Enable Kubernetes as a provider:

hal config provider kubernetes enable

Add a Kubernetes account; docker-desktop in the command below is the context of the cluster running on Docker Desktop:

hal config provider kubernetes account add my-account --context docker-desktop

Now associate your Kubernetes account (my-account) with Halyard:

hal config deploy edit --type distributed --account-name my-account

Configure storage using Redis. This will be changed later, since Halyard doesn’t allow setting MySQL directly:

hal config storage edit --type redis

Now enable artifacts. The Artifacts feature in Spinnaker allows the system to manage and deploy artifacts (such as Docker images, JAR files, and Debian packages) as part of your deployment pipelines:

hal config features edit --artifacts true

Configuring Spinnaker to use MySQL

Next, you have to configure Spinnaker to use the MySQL database. Create the /home/spinnaker/.hal/default/profiles/front50-local.yml file and insert the following configurations:

sql:
  enabled: true
  connectionPools:
    default:
      default: true
      jdbcUrl: jdbc:mysql://MYSQL_IP_ADDRESS:3306/front50
      user: root
      password: 123
  migration:
    user: root
    password: 123
    jdbcUrl: jdbc:mysql://MYSQL_IP_ADDRESS:3306/front50
spinnaker:
  redis:
    enabled: false

Replace MYSQL_IP_ADDRESS with the appropriate IP address. Also make sure that other credentials match with what you used to deploy MySQL earlier.

You can get the MYSQL IP by running the following command (outside the Hayland container):

kubectl get pods -o wide

Apply the deployment (in the Hayland container). This command is used to apply the changes made to the Spinnaker configuration and deploy or update Spinnaker in the target environment:

hal deploy apply

Now you can check to see whether the pods are running completely:

kubectl get pods -n spinnaker

We need the deck and gate pods to be running so we can access the Spinnaker UI.

Now we can forward the deck and gate pods so that we can access it on the browser. Do this with the following command:

kubectl -n spinnaker port-forward <spin-deck-pod-name> 9000

On another terminal, run the gate :

kubectl -n spinnaker port-forward <spin-gate-pod-name> 8084

Now you can access the Spinnaker UI at http://localhost:9000/ and start developing your pipelines.

Conclusion

Flux CD, Argo CD, and Spinnaker offer distinct advantages and cater to different use cases within the realm of continuous delivery. By evaluating their architectures, features, and integrations, you can make informed decisions about the best way to automate your deployment and delivery processes.