DEV Community: Jean

How to scan local files for secrets in python using the GitGuardian API

Jean — Mon, 29 Jun 2020 15:54:49 +0000

Do you how many secrets, like API keys or credentials, are hidden in your local files? Today, we're going to show you how you can scan files and directories for sensitive information like secrets. To accomplish this, we'll use the GitGuardian API and python wrapper. By the end of this tutorial you will understand how the API works so you can start building your own, custom secrets detections scripts.

What our script will do

We will create a python script that will scan all files within a local directory for secrets. To do this we will be using the GitGuardian API and the API python wrapper, we recommend reviewing these resources before starting.

Our script will:

Detect secrets and other policy breaks from your file directory.
Print the filename, policy break and matches for all policy breaks found.
Output the result to a JSON format.

Getting setup

Before we get started writing our script, let's get the necessary components setup.

Installing the GitGuardian python API client

Install the GitGuardian python API client using de facto package manager ‘pip’.

In your terminal or command line execute the command:

pip3 install --upgrade pygitguardian

Obtaining the GitGuardian API token

Sign up for a free developer account from GitGuardian using your GitHub account or email at https://dashboard.gitguardian.com.

From the menu, navigate to the ‘API’ tab, scroll to ‘Generate new API key’ and select ‘Create new API key’. Make sure you give it an appropriate name.

You will not be able to view the API key again so make sure you immediately copy it to your clipboard before navigating away.

Setting up the directory and files

Open your terminal or command line.

Create a new directory in the local you wish to save your script:

mkdir  directory-scan

Enter into the directory using:

cd  directory-scan

Setting environment variables

As a stickler for good coding practices, this tutorial will use environment variables to store our API token (rule number one, never hardcode secrets in source code!).

I recommend using a tool called python-dotenv which will allow you to import environment variables, or you can set the API token in your console.

Create a new file called .env:

touch .env

Open this file in your chosen text editor and create a variable to store our API:

GG_API_KEY=**INSERT API TOKEN**

Writing our script

Importing the modules and setting up our client

Next let's create a file called directory_scan.py:

touch directory_scan.py

Open this file with your chosen text editor.

First we need to import the modules we need:


import  glob
import  os
import  sys
import  traceback

In addition to standard modules. We will also be importing and using ‘glob’. This will allow us to get the path and file names of all the files within our directory.

Importing environment variables

If you are using python-dotenv then we need to load in our API token from our .env file and use the API token within:


from  dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv("GG_API_KEY")

Now, thanks to load_dotenv(), you’ll be able to retrieve the GG_API_KEY this way and store it in a variable.

Importing our GitGuardian API client modules

Next load the GitGuardian API client.

from pygitguardian import GGClient
from pygitguardian.config import MULTI_DOCUMENT_LIMIT

‘GGClient’ is the core module for our API client, it will handle the data we are scanning, send it to the GitGuardian scanning engine and receive the results.

The GitGuardian API only allows a maximum package of 20 files or a total size of 2mb for each request to allow for asynchronous scanning. Our MULTI_DOCUMENT_LIMIT module imports these parameters so we don’t send invaild requests to the server.

This does not mean you can only scan 20 files at a time. Our script will handle this by breaking the files into ‘chunks’ that meet the maximum API requirements and send multiple requests before collating the information at the end.

Now we just need to initialize the GGClient by attaching our API key:

# Initializing GGClient
client = GGClient(api_key=API_KEY)

Loading files into an array

We now need to load in all the files and file paths within our current directory. Our script scans recursively from the working directory (the directory from which the script is called):

# Create a list of dictionaries for scanning
to_scan = []
for name in glob.glob("**/*", recursive=True):
    if ".env" in name: or os.path.isdir(name):
        continue
    with open(name) as fn:
        to_scan.append({"document": fn.read(), "filename": os.path.basename(name)}))

The ‘glob’ module allows us create a list of files and path names that we will add into an array called to_scan so we can scan them.

We are going to also add a if statement that will exclude both our .env file and also ignore any folders we are trying to add into our array which will create an error (the files within the folders will still be added).

This will scan files recursively (from the working directory the script is within), but if you want to scan a different directory you can add the in the file path.

Example
for name in glob.glob("users\user\documents\**", recursive=True):

If you want to check your code is working so far. On a new line add ‘print(to_scan)’. You should get a list of all the files and their contents within your current directory. Comment out or remove before continuing.

Process the files in ‘chunks’ and making the API request

As previously mentioned the API will only accept 20 files per request with a maximum of 1MB per file. So we are going to break up our files into acceptable chunks to send as a request:

# Process in a chunked way to avoid passing the multi document limit
to_process = []
for i in range(0, len(to_scan), MULTI_DOCUMENT_LIMIT):
   chunk = to_scan[i : i + MULTI_DOCUMENT_LIMIT]
   try:
       scan = client.multi_content_scan(chunk)
   except Exception as exc:
       # Handle exceptions such as schema validation
       traceback.print_exc(2, file=sys.stderr)
       print(str(exc))
   if not scan.success:
       print("Error scanning some files. Results may be incomplete.")
       print(scan)
   to_process.extend(scan.scan_results)

First, create an empty array to hold the scan results from the API, we call this array to_process.

We are going to loop through our to_scan array containing our file paths and break them into chunks. To do this we are using a ‘range’ function which we will pass a start value, end value and stepping value.

range(start_value, end_value, step)

We are going to load the current values of our array into a variable called ‘chunk’.

Using a try block, we will scan our current chunk using the multi content scan command of the GG API client.

Of course we need to handle any expectations where the scan will fail, for example if the filename is too long for our schema.

The traceback will show the exact line it failed.

Let's add in a message in the scenario our scan fails.

Finally we are going to append our scan results to our array ‘to_process’.

FAQ: If I need to scan 200 files, will this count as 1 or 10 API requests in my dashboard? It will count as 10 but don’t worry you have 1,000 API requests a month.

Printing results

Now we will loop through our results. If a policy break is detected it will be captured by the .has_secrets tag, if this is true, we will print that result:

# Printing the results
for i, scan_result in enumerate(to_process):
   if scan_result.has_policy_breaks:
       print(f"{chunk[i]['filename']}: {scan_result.policy_break_count} break/s found")
Now we will loop through our results. If a policy break is detected it will be captured by the .has_policies_breaks tag, if this is true, we will print that result.

Code Checkpoint 1

We are ready to run our first scan so let's quickly make sure our code is the same.

import  glob
import  os
import  sys
import  traceback
from  dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv("GG_API_KEY") 

from pygitguardian import GGClient
from pygitguardian.config import MULTI_DOCUMENT_LIMIT

# Initializing GGClient
client = GGClient(api_key=API_KEY) 

# Create a list of dictionaries for scanning
to_scan = []
for name in glob.glob("**/*", recursive=True):
    with open(name) as fn:
        to_scan.append({"document": fn.read(), "filename": os.path.basename(name)})) 

# Process in a chunked way to avoid passing the multi document limit
to_process = []
for i in range(0, len(to_scan), MULTI_DOCUMENT_LIMIT):
   chunk = to_scan[i : i + MULTI_DOCUMENT_LIMIT]
   try:
       scan = client.multi_content_scan(chunk)
   except Exception as exc:
       # Handle exceptions such as schema validation
       traceback.print_exc(2, file=sys.stderr)
       print(str(exc))
   if not scan.success:
       print("Error scanning some files. Results may be incomplete.")
       print(scan)
   to_process.extend(scan.scan_results)

# Printing the results
for i, scan_result in enumerate(to_process):
   if scan_result.has_secrets:
       print(f"{chunk[i]['filename']}: {scan_result.policy_break_count} break/s found")

Running the script

We are now ready to run our first directory scan.

You can download some example files that contain expired secrets here so you can test your script.

Move the directory_scan.py file into the directory you want to scan.

Open your terminal, navigate to the directory and run the command:

python3 directory_scan.py

Congratulations you just scanned your directory for policy breaks!

After your script has run, you will receive feedback with the amount of policy breaks that have been found.

main.py: 1 break/s found
sample.yaml: 1 break/s found

Now we know what files have policy breaks.

But we don't know if the policy break is a secret and we don't know what kind of secret it is. So next we will add some additional detail into our results.

Displaying additional information

Including policy break type and matches

Now we have detected policy breaks, we may wish to know what policy breaks have been detected, for example was it a Slack token, an AWS key or a filename policy that was broken.

Let's add a line in the output that tells us what policy breaks have been broken.

You can find more information on policy breaks in the GitGuardian dashboard

 # Printing the results
for i, scan_result in enumerate(to_process):
   if scan_result.has_policy_breaks:
       print(f"{chunk[i]['filename']}: {scan_result.policy_break_count} break/s found")
       # Printing policy break type
       for  policy_break in scan_result.policy_breaks:
           print(f"\t{policy_break.break_type}:")

Now we are going to add a nested loop within our previous loop and for each policy break, we are going to use the break_type tag in the GG client to print the type of policy break that has occurred (in other words, the type of secret, filename or extension that has triggered the alert).

Now if we run our function again, we will get the same results, but this time we will also get the name of the policy break next to each file.

    main.py: 1 break/s found
        AWS Key: 
    sample.yaml: 1 break/s found
        Google API Key:

Adding matches

It is not always appropriate to print the matches we find, but in the case of this example we are going to do just that.

We are going to create another for loop, again nested. We are now going to call the tag ‘match’ from the GG and print that. This will give us our policy break.

 # Printing the results
for i, scan_result in enumerate(to_process):
   if scan_result.has_policy_breaks:
       print(f"{chunk[i]['filename']}: {scan_result.policy_break_count} break/s found")
       # Printing policy break type
       for  policy_break in scan_result.policy_breaks:
           print(f"\t{policy_break.break_type}:")
           # Printing matches
           for match in policy_break.matches:
                print(f"\t\t{match.match_type}:{match.match}")

Lets run this again and we should now get

File name and number of policy breaks.
Policy break types (secrets if any).

    main.py: 1 break/s found
        AWS Key: *********************************
    sample.yaml: 1 break/s found
        Google API Key: *****************************

Retrieving the output as JSON

Now let's say we need to output these results in JSON format.

The API has built in functionality to convert results into JSON format.

Let's loop through our files and if our scan results have policy breaks within them, we will print them in JSON.

  #Getting results in JSON format           
for i, scan_result in enumerate(to_process):
   if scan_result.has_policy_breaks:
       print(scan_result.to_json())

Checkpoint 2

You're done! Let's do a final code review to make sure your code is correct.

import  glob
import  os
import  sys
import  traceback
from  dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv("GG_API_KEY") 

from pygitguardian import GGClient
from pygitguardian.config import MULTI_DOCUMENT_LIMIT

# Initializing GGClient
client = GGClient(api_key=API_KEY) 

# Create a list of dictionaries for scanning
to_scan = []
for name in glob.glob("**/*", recursive=True):
    with open(name) as fn:
        to_scan.append({"document": fn.read(), "filename": os.path.basename(name)})) 

# Process in a chunked way to avoid passing the multi document limit
to_process = []
for i in range(0, len(to_scan), MULTI_DOCUMENT_LIMIT):
   chunk = to_scan[i : i + MULTI_DOCUMENT_LIMIT]
   try:
       scan = client.multi_content_scan(chunk)
   except Exception as exc:
       # Handle exceptions such as schema validation
       traceback.print_exc(2, file=sys.stderr)
       print(str(exc))
   if not scan.success:
       print("Error scanning some files. Results may be incomplete.")
       print(scan)
   to_process.extend(scan.scan_results)

# Printing the results
for i, scan_result in enumerate(to_process):
   if scan_result.has_secrets:
       print(f"{chunk[i]['filename']}: {scan_result.policy_break_count} break/s found")
       # Printing policy break type
       for  policy_break in scan_result.policy_breaks:
           print(f"\t{policy_break.break_type}:")
           # Printing matches
           for match in policy_break.matches:
                print(f"\t\t{match.match_type}:{match.match}")

#Getting results in JSON format    
for i, scan_result in enumerate(to_process):
   if scan_result.has_policy_breaks:
       print(scan_result.to_json())

Warning

Please note that you should only scan for secrets in places they should not exist and revoke any that are found. As a general rule, any secrets in that end up in remote locations not specifically designed to secure sensitive data should be considered compromised. This includes using the GitGuardian API.

Next Steps

Now that you have created your first script using the GitGuardian API and python wrapper you can create your own awesome scripts to scan files.

The next tutorial will help you scan files pre-commit or in the CI.

Any questions on the API please email us, mackenzie.jackson@gitguardian.com.

What is secret sprawl, why it’s dangerous, and how developers can prevent it?

Jean — Thu, 04 Jun 2020 14:07:07 +0000

When developers refer to secret sprawl they are typically referring to the unwanted distribution of secrets across multiple platforms, services and machines. Once a secret ‘sprawls’ into other systems it can often have a follow on effect allowing attackers to use secrets to move laterally between services and uncover additional secrets.

To really understand how secrets sprawl, we need first to understand two key concepts, what is a secret and how we use secrets.

What is a secret?

As the name suggests, a secret is really any data that is sensitive but when discussing secrets in the context of software development, developers are generally referring to anything that grants access to external services or data. These are most commonly API keys, credentials and security certificates.

How do developers use secrets?

Software used to include everything needed to run internally, today as the world is much more reliant on the internet, this has allowed software architecture to fundamentally change with the introduction of new services such as Cloud Architecture, SaaS Platforms and Microservices.

These services allow a lot of development work not related to the core of an application to be offloaded, this reduces the upfront development costs while simultaneously making applications more robust and scalable. As beneficial as this is, it does introduce a new challenge to overcome, how to establish a trusted and secure connection with each of these services. This is generally done through an exchange of secrets, namely API keys, security certificates and credentials.

The challenge

Depending on the size and objective of an application, a project might need to connect to tens or even hundreds of services which all need individual secrets.

Developers not only need to store these secrets safely, they also need to be able to distribute and use secrets during their development process. Adding an even greater level of complication, secrets will often get rotated and revoked over time, this means that the distribution of secrets is a challenge that will last throughout the entire software development lifecycle.

How secrets sprawl across the internet

Secrets management requires a thoughtful understanding of what permissions to give secrets, who needs access to them, how to keep them in sync across multiple teams (often in different geographies), and what restrictions, tools and guidelines need to be in place when accessing and using them.

Strict secret management creates added procedures that are both difficult to implement and tempting to circumvent. This is why developers and organizations alike often store secrets in unsecure locations, usually unintentionally. Secrets can be hardcoded into source code and included in a git repository which can get cloned onto multiple machines (professional and personal), get sent via Slack or emailed for convenience, saved to an internal Wiki and uploaded into a google drive…. So on and so forth.

Secrets sprawl increases the 'attackable area'

Even if secrets don’t end up on public internet space (for example on a public git repository) they should still be considered compromised if they are sprawled. Having secrets on multiple services, email, Slack, git etc increases what is referred to as the 'attackable area’.

The attackable area refers to the amount of systems that could be exploited to find secrets. In a situation where an organization has secrets sprawled over multiple locations, it only takes one compromised developer's git account, one compromised email or one compromised computer for an attacker to suddenly gain access to a trove of highly sensitive secrets.

Secrets can be used to travel laterally between systems too, for example a secret allowing an attacker access to Slack messages might lead him to discover secrets with access to a cloud drive which might uncover secrets to a database….. So on and so forth. Secrets should remain centralized and encrypted.

How do you prevent secret sprawl?

Secrets management as stated previously, is difficult. But there are great tools available that can be implemented to help tackle the issue of secret sprawl.

WATCH | How to avoid secrets sprawl in your organization?

Secrets Scanning

To combat secret sprawl it is essential to have visibility inside the systems where secrets may be located. .git repositories can contain a trove of secrets buried in the history including inside old versions of source code, application logs and config files. It is important to consider also that even the best secrets management systems and policies do not prevent newly generated secrets entering the code base or old secrets being extracted and included again, therefore all organizations should implement secrets scanning into their workflow.

GitGuardian offers a free secrets scanning tool for .git repositories which scans, in real time, every commit you make so you can immediately identify if your secrets have sprawled. GitGuardian also has an API so all office systems like slack or email can be scanned for secrets too.

Encrypting Secrets

Git repositories offer unmatched collaboration features for developers, git not only acts as a complete historic record of a project but also offers a single point of truth for the latest version and files, hence why it is so common for secrets to be stored within them. The good news is that there are ways to store secrets securely within git repositories.

Git-secret is a free open source tool that encrypts secrets within git repositories making them safe to distribute through git.

While encrypting secrets and storing them within git does provide the benefit of preventing secrets sprawling through the git, it does not prevent secret sprawl on other services and developers will still need to manage the secrets to decrypt the secret file (Secret sprawl can be like inception, secrets for files that contain secrets inside, which give access to services with secrets inside them).

Using secrets management solutions

One of the most popular secrets management tools on the market, Hashicorp Vault, offers both open-source and enterprise solutions to developers and organizations and provides the ability to tightly restrict and control access to secrets, enabling the easy rotation of secrets while also giving developers the ability to easily connect to external services.

Hashicorp vault can be difficult to roll out and implement and might not be appropriate for all types of secrets.

Conclusion

No organization big or small is immune from secret sprawl and the best policies and tools still won’t stop every possibility of secret sprawl. This is why to combat secret sprawl you need to combine a strategy to store secrets, manage secrets (rotate and distribute) with a strategy to gain visibility into your services and systems.

Get visibilty over your systems now with GitGuardian

GitHub security: what does it take to protect your company from credentials leaking on GitHub?

Jean — Wed, 20 May 2020 14:38:43 +0000

This post was originally written by GitGuardian's CEO Jérémy Thomas. This guide is intended for CISOs, Application Security, Threat Response, and other security professionals who want to protect their companies from credentials leaking on GitHub.

Read this guide if:

You are aware of the risks of corporate credentials leaking on public GitHub. If you still need convincing, we have 3 years of historical GitGuardian monitoring data that we can filter down to your company domain name, aggregate to remove sensitivity, and share with you upon request, without you taking any of your time to talk to our sales reps if you don’t want to:

Get three years of historical GitGuardian monitoring data filtered down to your company domain name

You are in the market for a solution, and would like to investigate the requirements such a solution should have.

Disclaimer: I am the CEO of GitGuardian, which offers solutions for detecting, alerting and remediating secrets leaked within GitHub, therefore this article may contain some biases. GitGuardian has been monitoring public GitHub for over 3 years which is why we are uniquely qualified to share our views on this important security issue. Security professionals are often overwhelmed by an army of vendors, many of which are equipped with disputable facts and figures, and favor the use of scare tactics. These professionals therefore prefer to leverage their network or peer recommendations to make buying decisions. I am confident that the information in this guide can be backed up by solid and objective evidence. If you’d like to share your comments on it, please email me directly at jeremy [dot] thomas [at] gitguardian [dot-com].

Requirements for public GitHub monitoring and why they are important

We’ve classified requirements in functional categories:

We will go through these requirements one by one and explain why they are important.

Define & Monitor your perimeter

Monitoring your perimeter requires the ability to automatically associate repositories, developers and published code with your organization. There are millions of commits per day on public GitHub, how can organizations look through the noise and focus exclusively on the information that is of direct interest to them?

Organization repositories monitoring

These are the repositories that are listed under your company’s GitHub Organization, if your company has one. This only concerns companies which have open-source projects. Less than 20% of corporate leaks on GitHub occur within public repositories owned by organizations. The majority of the remaining leaks occur on developers’ personal repositories, and a small portion also occurs on IT service providers' or other suppliers’ repositories.

Developers’ personal public repositories monitoring

Around 80% of corporate leaks on GitHub occur on their developers’ personal public repositories. And yes, I’m really talking about corporate leaks, not personal ones.

In the vast majority of the cases, these leaks are unintentional, not malevolent. They happen for many reasons:

Developers typically have one GitHub account that they use both for personal and professional purposes, sometimes mixing the repositories.
It is easy to misconfigure git and push wrong data.
It is easy to forget that the entire git history is still publicly visible even if sensitive data has since been deleted from the actual version of source code.

Detect incidents

Sensitive information that is leaked on the platform generally falls under two categories:

What developers call “secrets”,
Intellectual Property like proprietary source code.

Secret: anything that gives access to a system: API keys, database connection strings, private keys, usernames and passwords. Secrets can give access to cloud infrastructure, databases, payment systems, messaging systems, file sharing systems, CRMs, internal portals, ...

It is very rare, in our experience, to see valid PII leaked on the platform, although we often see secrets giving access to systems containing PII.

High precision

Precision answers the question: “What is the percentage of sensitive information that you detect that is actually sensitive?”. This question is perfectly legitimate, especially in the context of SOCs being overwhelmed with too many false positive alerts.

Precision is easily measurable: the vendor sends alerts, and users can give feedback through a “true alert” / “false alert” button. Your vendor should be able to present precision metrics, backed by strong evidence records and well-defined methodology.

High recall

This one is a bit tougher than precision. Recall answers the question: "What is the percentage of sensitive information you failed to detect?". Having a high recall means having a small number of missed secrets. This question is also very important, considering the impact that a single undetected credential can have for an organization.

Recall is more complicated to measure than precision. This is because finding sensitive information in source code is like finding needles in a haystack: there are a lot more sticks than there are needles. You need to manually go through thousands of sticks in order to realize that you’ve missed a needle or two. A decent proxy for recall is the number of individual API keys and additional sensitive information supported by your vendor.

A good algorithm is able to achieve excellence in precision AND recall.

Ability to detect unprefixed credentials

Some secrets are easier to find than others, especially prefixed credentials that are strictly defined by a distinctive, unambiguous pattern.

The majority of published credentials however, don’t fall into this category. Therefore, any solution based entirely on prefix detection will miss a lot of leaked credentials. Your vendor must be able to detect Datadog keys or OAuth tokens for example, using techniques involving a combination of entropy statistics and sophisticated pattern matching applied not on the presumed key itself, but on its context.

Keyword monitoring

When choosing keywords that you would like to be alerted on, make sure keywords are distinctive enough to be uniquely linked to your company. Good keywords are typically: internal project names (providing they are not common), internal URLs or a reserved IP address range (although not technically a keyword).

Do you want to know if a given keyword is distinctive enough? Try using the GitHub built-in search for an estimation of the results it would bring (this is just an estimation, as the GitHub search is rather limited).

A concrete example with “docker.com”. With over 597K source codes containing the keyword, it is not a good candidate for keyword matching.

Alert

Real-time alerting

When remediating GitHub leaks, you are in a race against time.

This is especially true when discussing leaked credentials (as opposed to Intellectual Property):

You are not fighting against the information being more and more widely spread. Because the moment you invalidate the credential, it no longer creates a threat meaning you are no longer concerned if it is further disseminated (except for brand reputation considerations), since it does not give access to anything anymore.
You are fighting against the credentials being exploited. Credentials are extremely easy to exploit by anyone, without any specific knowledge.

Any detection that involves human operators (typically filtering too many false positives) and is not fully automated is probably already too long to react. You must expect your vendor to come up with strong evidence that their reaction time is counted in minutes, not hours or days.

Developer alerting

Facing a leak can be a tough process that requires speed, and knowledge from multiple people (typically Threat Response / Application Security / Developers).

Since the developer responsible for the leak is at the forefront of the issue, they can be your first responders. This is especially the case if your solution raised the alert fast enough after the leak occurred, so that your developer is still in front of their computer.

The developer generally knows what the credential gives access to, services and applications that rely on it, other developers who use it, etc. But they often don’t have the right to revoke the credentials and redistribute them.
Application Security or “DevOps” personnel have the ability to inspect logs generated during the time the key was exposed, evaluate the way a potential hacker could have moved to other systems from this entry point, revoke the credentials and redistribute them.
Threat Response will make sure that procedures are followed, in terms of investigation, remediation, internal communications, public relations, legal, lessons learned and feedback loop.

Remediate

Integration with your remediation workflow

It is quite obvious that the solution should be integrated with your preferred SIEM, ITSM, ticketing system or chat.

One thing to keep in mind: if your organization is spread over multiple geographies / time zones, automatically associating a GitHub leak with a geography for incident dispatching purposes might not be always possible without first providing additional information to your vendor . This means that you will either have to provide your vendor with a list of your developers and their geographies, or indicate a single entry point that has global responsibilities for your vendor to alert you.

Collaboration with developers

Potential damage can rarely be estimated just by looking at the code surrounding the leaked credential, and sensitive corporate credentials are often leaked on developers’ personal projects. When alerted about a GitHub leak, your first remediation step will always be to check whether or not the developer is still working in your company, and to reach out to them with a questionnaire to gather input for impact assessment and incident prioritization.

Log & Analyze

Logging can answer multiple needs, depending of course on the data that is logged: post-incident analysis, reporting to management, demonstrating compliance to customers or auditors, security (audit trail), or transparency in what the solution is really doing.

I’d like to briefly insist on this last point: ask your vendor for proof points! An ideal solution would provide a detailed list of every monitored developer and repository, as well as logs of every single commit that was analyzed, and reproducible results of conducted scans.

About GitGuardian

As the CEO of GitGuardian, I’m always extremely grateful for the time and trust that security professionals give us. We’ve built a sales organization that is thoroughly trained to behave with extreme respect and professionalism. This is how we sell cybersecurity software at GitGuardian:

No scare tactics.
Since we’ve been monitoring GitHub for 3 years now, we can provide you with personalized data from your company’s perimeter. We aggregate the data to remove any sensitive content and in order for you to evaluate whether or not it is worth dedicating time to evaluate our solution.
Consultative approach: we’ll come up with questions to help you evaluate your needs, and are always keen on sharing the technical details of what we do with your tech teams.
Later in the sales process, we’ll show your security team the GitGuardian dashboard populated with actual data from your company.
If we feel we’re not a good fit for your needs, we’ll let you know early in the process.

Contact us to protect your company from credentials leaking on GitHub

8 free security tools every developer should know and use to Shift Left

Jean — Fri, 15 May 2020 12:44:19 +0000

Shifting left is a development principle which states that security should move from the right (or end) of the software development life cycle (SDLC) to the left (the beginning). In other words: security should be integrated and designed into all stages of the development process. This new shift requires developers to take more ownership of security and security principles. The good news is that there are lots of tools available to help developers in this process.

In this blog we will break up Application Security into key areas and walk through some free and open-source solutions that will help developers and organizations make sure, at every stage of their SLDC, the incremental changes they make improve the overall quality and security of their software.

Shifting left may feel like adding extra work to a developer's already full plate, but in reality, it empowers developers to learn more about great security practices which results in less time spent fixing bugs and more time spent building great applications.

Application Security

It is important to realize that all application security vulnerabilities cannot be fixed by a single product. Successful security requires a layered approach with many lines of defence for different stages of the SDLC.

The tools we will investigate cover:

SAST - Static Application Security Testing
DAST - Dynamic Application Security Testing
IAST - Integrated Application Security Testing
RASP - Run-time Application Self Protection
Dependency Scanning
Secrets Detection

While it is true that vulnerabilities picked up early are easier - and cheaper - to remediate, you cannot rely on finding all vulnerabilities during the early stages of the development. Security needs to be a concern throughout the entire SDLC.

Static Application System Testing - also known as “white box testing”, is the most common and earliest category of automatic application security. SAST scans an application's source code to discover any known vulnerabilities. Because SAST does not require an application to be compiled or running to start detecting vulnerabilities (unlike DAST) it can be implemented very early in the SDLC.
It also enforces coding guidelines and standards without executing the underlying code. This category of application testing has a wide variety of solutions available, so when deciding on using one, make sure the solution is well supported and maintained and works within your technology stack. Here are some of the best free SAST tools.

NodeJsScan

NodeJs Scan has a command line interface for easy integration with DevSecOps CI/CD pipelines and produces results in JSON.
A configuration file is available for each language which can be modified for customized searches. Overviews of files, as well as an entire codebase, can be visualized through stats and pie charts. The program can detect buffer overflows and flaws in Java code that may contain OWASP security risks.

SonarQube

Widely regarded as one of the best automated code review tools available in the market. SonarQube has thousands of automated Static Code Analysis rules. SonarQube also supports 27 languages which are a mix of both modern and legacy so that SonarQube can cover your entire project and its continuous development.

Dynamic Application Security Testing - also known as “black box” testing, doesn’t find vulnerabilities in source code like SAST, instead it finds vulnerabilities in running applications. It does this by employing fault injection techniques on an app. DAST can identify common security vulnerabilities, such as SQL injection and cross-site scripting. DAST can also cast a spotlight on runtime problems that can’t be identified by static analysis, like authentication and server configuration issues, as well as flaws visible only when a known user logs in.

OWASP ZAP

OWASP ZAP is a full-featured, free and open source DAST tool that includes both automated scanning for vulnerabilities and tools to assist expert manual web app pen testing. ZAP has a large list of vulnerabilities that it can exploit and identify.

Interactive Application Security Testing - Which is also sometimes known as "grey box" testing, is technology that combines elements of both SAST and DAST simultaneously. It is typically implemented as an agent within the test runtime environment (for example, instrumenting the Java Virtual Machine [JVM] or .NET CLR) that observes operation or attacks and identifies vulnerabilities.

Contrast Security - Community

Contrast is another developer-first product that is able to go deeper into vulnerabilities when compared to other SAST and DAST tools which are blind to the runtime context of applications such as the controller, application logic, data layer, presentation view, user libraries, open-source components, and application server.

Runtime Application Self Protection - is configured on a server and kicks in when an application runs. It's designed to detect attacks on an application in real-time. When the application begins to run, RASP can protect it from malicious input or behavior by analyzing both the app's behavior and the context of that behavior. By using the app to continuously monitor its own behavior, attacks can be identified and mitigated immediately without human intervention.

Sqreen

Sqreen’s Runtime Application Self-Protection identifies attacks that exploit vulnerabilities in production by leveraging the full execution context of requests.

Sqreen covers all of the OWASP top 10 security vulnerabilities such as SQL injection, XSS and SSRF. What makes Sqreen so powerful is its ability to leverage the execution logic of requests to block attacks with much lower false positives than other solutions available. Sqreen also is able to adapt to your applications specific stack so you do not need any redeployment and configuration within your application making setup simple and fast.

Dependency Scanning helps to automatically find security vulnerabilities in your dependencies while you are developing and testing your applications, for example when your application is using an external (open source) library which is known to be vulnerable.

Snyk

Snyk is a developer first organization with well maintained open-source solutions for developers and effective enterprise solutions available for larger organizations.

Snyk has a range of great features that help make security part of the development process from day one such as the ability to detect vulnerabilities from within your IDE and native git scanning to test projects within the repositories. Snyk also provides a security gate to prevent new vulnerabilities from passing through the build process and a production environment to test your running environment to verify there is no exposure to existing vulnerabilities.

WhiteSource Bolt for GitHub

WhiteSource like Snyk has some great free tools for developers as well as large enterprise solutions for larger organizations. WhiteSource Bolt for GitHub is a FREE app, which continuously scans all your repos, detects vulnerabilities in open source components and provides fixes. It supports both private and public repositories.

Over 200 programming languages are supported with continuous tracking of multiple open source vulnerabilities databases like the NVD.

Secrets like API keys, database credentials and security certificates are the crown jewels of organizations and can provide access to sensitive systems. Secrets detection scans source code, logs and other files for hidden secrets. This is a specialist service as most secrets are usually always high entropy strings (strings designed to appear random), but most high entropy strings are not secrets, which makes them very hard to detect. It requires advanced classification algorithms to detect secrets with high precision and recall.

Secrets detection is often confused with SAST because both scan through source code. Unlike SAST, which is concerned only with the current version of an application, secrets detection is concerned with the entire history of the project. Version control systems such as git, keep track and store all changes to an project. If previous versions of source code contains hard-coded secrets within, that were removed in late stages, code reviews and SAST tools will miss these secrets which may end up in a git repository and become compromised. This is why secrets detection is a category on its own.

GitGuardian

GitGuardian’s technology works by scanning developers repositories for evidence of secrets.

GitGuardian covers more than 300 different types of secrets from keys to database connection strings, SSL certificates, usernames and passwords. These secrets are detected through a combination of algorithms, including sophisticated pattern matching techniques. GitGuardian can be integrated with your GitHub account and configured within minutes and developers can use the GitGuardian API to detect secrets in any services including within directories, email clients or Slack channels.

Try GitGuardian, the best free security tool to find secrets in your code

Wrap up

With so many solutions available it can feel daunting to decide what tool to select within each category. Always consider how each tool fits into your current workflow as even great tools can be rendered useless if they become too difficult to use.

Each application is different and the tools outlined above should be considered a minimum level of protection, but you and your organization may need more detailed solutions. Security is one of the most highly valued skills in a developer, although shifting security "left" can seem like a daunting task, it is a worthwhile investment to understand and implement these systems within your entire development life cycle.

Assessing model performance in secrets detection: accuracy, precision & recall explained

Jean — Wed, 06 May 2020 13:12:35 +0000

Detecting secrets in source code is like finding needles in a haystack: there are a lot more sticks than there are needles, and you don’t know how many needles might be in the haystack. In the case of secrets detection, you don’t even know what all the needles look like!

This is the issue we are presented with when trying to evaluate the performance of probabilistic classification algorithms like secrets detection. This blog will explain why the accuracy metric is not relevant in the context of secrets detection, and will introduce two other metrics to be considered together instead: precision and recall.

Accuracy, precision and recall metrics answer three very different questions:

Accuracy: What percentage of times did you take a stick for a needle, and a needle for a stick?

Precision: Looking at all the needles that you were able to find, what percentage are actually needles?

Recall: Among all needles that were to be found, what percentage of needles did you find?

Why is accuracy not a good measurement of success for secrets detection?

The difference is subtle in their descriptions but can make a huge difference.

Going back to the needle analogy, if we take a group of 100 objects, 98 sticks and 2 needles then we create an algorithm to detect all the needles. After running, the algorithm identified all sticks correctly but only 1 needle, then this algorithm failed 50% of the time at its core objective, yet because it detected the sticks correctly it still has a 99% accuracy rate.

So, what happened? Accuracy is a common measurement used in model evaluation, but in this case, accuracy gives us the least usable data, this is because there are a lot more sticks than there are needles in our haystack, and equal weight is applied to both false positives (the algorithm took a stick for a needle) and false negatives (the algorithm took a needle for a stick).

This is why accuracy is not a good measurement to determine success in secrets detection algorithms. Precision and recall look at the algorithm's primary objective and use this to evaluate its success, in this case, how many needles were identified correctly and how many needles were missed.

High Precision = low number of false alerts
High Recall = low number of secrets missed

It is really easy to create an algorithm with 100% recall: flag every commit as a secret. It is also really easy to create an algorithm with 100% precision as well: flag only one time, for the secret you are the most confident it is indeed a secret. These two naive algorithms are obviously useless. It is combining both precision and recall that lies the challenge.

So how can we properly evaluate the performance of a secrets detection model?

Let’s take a hypothetical algorithm that scans 1,000 source code files for potential secrets.

In this example we will state:

975 files contained no secrets within the source code
25 files contained secrets within the source code

The algorithm detected

950 True Negatives (TN): No secrets detected where no secrets existed
25 False Positives (FP): Detected secrets that were not true secrets
15 True Positives (TP): Detected secrets where secrets exist
10 False Negatives (FN): Detected no secrets where secrets did exist

This can be displayed on a confusion matrix (below) which is a performance measurement tool for classification algorithms to help visualize data and calculate probabilities.

We can use this matrix to gather a range of different data including accuracy, precision and recall.

So what do these results show? We can extrapolate that our model has a 96.5% accuracy rate, this seems pretty good, and you may think that it means it detects secrets 96.5% of the time.

This would be incorrect because this hypothetical model is really only beneficial for not detecting secrets that aren’t there. This is similar to an algorithm that's great at predicting car accidents that don’t happen.

If we look at metrics other than accuracy, we can see how this model begins to fail.

Precision = 40%
Recall = 60%

All of a sudden, the model doesn’t seem to be beneficial. It only returns 60% of the secrets and only 40% of total returned secrets are true positives!

Balancing the equation: achieving a high precision, high recall secrets detection algorithm

Balancing the equation to ensure that the highest possible number of secrets are captured without flagging too many false results is an intricate and extremely difficult challenge.

So difficult that GitGuardian dedicated an entire team to train secrets detection algorithms and find the correct balance of high recall and high detection.

This is essential when a precision that is too high may lead to secret leaks to go undetected, while a low precision will create too many false alerts, rendering the tool useless.

There are no shortcuts when building and refining an algorithm. They need to be extensively trained with huge amounts of data and constant supervision.

When talking about why some models fail, Scott Robinson from Lucina healths talks of three core failures when training an AI algorithm:

(black box systems are those that are so complex they become impenetrable)

It is important also to realize that when building algorithms for probabilistic scenarios, they will change over time. There is no perfect solution that can remain the same, trends will change, secrets will change, data will change, formats will change and therefore, your algorithm will need to change.

“People can create an algorithm, but the data really makes it useful” Kapil Raina

GitGuardian as an example

GitGuardian is the world leader in secrets detection, which has been achieved largely due to the vast amount of information that has gone through the algorithm.

Over 1 billion commits scanned and evaluated every single year with over 500k alerts that have been sent to developers and security teams. We’ve collected a lot of explicit (alert marked as true or false) and implicit (commit or repository deleted after our alert) feedback. It is a great example of how effective retraining of data, particularly at this large scale, can be used to continuously improve an algorithm's precision and recall.

At the start in 2017, GitGuardian was detecting 200 secrets per day on GitHub, a benchmark set by other offerings on the market. With extensive model training, GitGuardian now detects over 3,000 per day with a 91% precision.

There are no shortcuts in building algorithms. We’ve battle-tested our algorithms on public GitHub with billions of commits (yes billions), and these algorithms can now be used to detect secrets in private repositories as well. It would have been impossible to launch detection in private repositories without doing so on public GitHub first.

=> Test out GitGuardian's automated secrets detection

Using Git hooks for automated secrets detection

Jean — Thu, 16 Apr 2020 12:46:14 +0000

Git hooks are extremely useful in the journey to replace as much of the human factor in the process of secure development as possible.

What are git hooks?

Git hooks are scripts that are triggered by certain actions in the software development process, like committing or pushing. By automatically pointing out issues in code, they allow reviewers not to waste time on mistakes that can be easily diagnosed by a machine.

There are client-side hooks, that execute locally on the developers’ workstation, and server-side hooks, that execute on the centralized version control system.

If you are interested to explore further git hooks, here is a curation of the most useful git hooks on GitHub.

Why implement secret detection in your SDLC?

As a general security principle, where feasible, data should remain safe even if it leaves the devices, systems, infrastructure or networks that are under organization control, or if these are compromised. Assuming a breach helps prevent lateral movement after a hacker gains initial access.

In their everyday life, developers handle a trove of sensitive information that hackers could leverage. They rely on hundreds of secrets like API keys, database connection strings, private keys, or certificates to interconnect payment systems, databases, CRMs, messaging and notification systems, internal services… Too often, these secrets are hardcoded in source code or shared over Slack or emails. All these systems are not designed to store and share secrets, nor are internal wikis a good place to expose usernames and passwords.

Indeed, because of the very nature of software development, source code is made to be cloned on different workstations, deployed on multiple servers, distributed to customers, etc. In practice, you never know where your code is going to end up. If it contains secrets, it takes just one of these places to be compromised for all the secrets to be compromised as well. Same reasoning holds for all developers having access to source code: it takes one compromised developer account to compromise all the secrets they have access to.

On top of that, API keys and other secrets that are used to programmatically authenticate or authorize ourselves are unlike traditional usernames and passwords: because they are made to be programmatically used, they aren’t further secured by MFA (most of the time).

Pre-commit, pre-push, pre-receive, post-receive: where to implement secret detection?

Here are some general principles about fitting security in the software development process:

The earlier a security vulnerability is uncovered, the less costly it is to correct. Hardcoded secrets are no exceptions. If the secret is uncovered after the secret reaches centralized version control server-side, it must be considered compromised, which requires to rotate (revoke and redistribute) the exposed credential. This operation can be complex and can involve multiple stakeholders.
People bend the rules, often in an effort to collaborate better together and do their job. Security must not be a blocker. It should allow flexibility and foster information flows, yet enable visibility and control. Security measures will be bypassed, sometimes for the worst. But it is also good sometimes that the developer can take the responsibility to bypass them. Talking about secrets detection: algorithms achieve a tradeoff between not raising false alerts (high precision) and not missing keys (high recall). Secrets detection being probabilistic, even the best algorithms can fail and need human judgement.

These principles advocate for the following:

Client-side secrets detection early in the software development process is a nice to have: implement pre-commit or pre-push hooks when possible. The good thing with pre-commits is that the secret is never added to the local repository, which comes in handy since removing a secret from your git history can be very tricky. Whereas the good thing with pre-push is that you’ve got an Internet connection there, allowing you to make API calls for example. This is not necessarily the case when committing.
Server-side secrets detection is a must have: take into account that depending on the size of your organization, enforcing client-side secrets detection might not be an easy task, as this requires access to your developers’ workstations. We’ve heard many times from Application Security professionals that this is not something they felt confident to do. In any case, keep in mind that client-side hooks can (and must, secret detection being probabilistic) be easy to bypass, hence the absolute necessity for server-side checks where the ultimate threat lies.

Secret detection has one extremely important peculiarity though: unlike cryptography weaknesses or SQL injection vulnerabilities that only express themselves the moment the code is deployed, any secret reaching version control system must be considered compromised thus requiring immediate attention, even if the code is not ready to be deployed yet. This implies that implementing secrets detection is not only about scanning the most actual version of your master branch before deployment. It is also about scanning through every single commit of your git history, covering every branch, even development ones.

Implementing GitGuardian

GitGuardian comes in the form of a dashboard centralizing policy breaks across your organization’s repositories.

It is natively integrated with your Version Control System in a post-receive fashion. When integrating a repository into your monitored perimeter, secret detection is enforced on every branch, without making any distinction between development and master branches. At every push, GitGuardian not only scans actual source code as would be the case if we were looking at other security vulnerabilities such as vulnerable dependencies, but on top of that, we also go through every incremental change that was made since the last push.

We also encourage you to run security checks early and often by using our API. Our API allows you to use our secrets detection as a service in pre-commits, pre-push, or in your CI (although CI builds aren’t typically enforced on all branches to scan every incremental change that was made). This would complement native integration with your Version Control System.

At GitGuardian, we are always keen to share the technical details of what we do, and all the subtleties we found in our journey to automate secrets detection. We are committed to doing so, even if it is not directly related to the objectives that we are trying to reach as a company. We do this in the spirit of Open Source, knowing that sharing technical details allows us to get more feedback from developers and Application Security professionals around the world, and ultimately create more value.

Try GitGuardian, sign up with GitHub

8 Steps to keep remote development teams secure

Jean — Tue, 07 Apr 2020 13:45:44 +0000

There is no doubt that the world's workforce is becoming more remote, particularly in tech as developers can now work from any location in the world. But there are a large number of new obstacles that come with this. The most pressing is security.

Take the current COVID-19 health crisis. From one day to the next, countries are going into quarantine and forcing companies and developers into working remotely. I for one am writing this from my home office in Paris, sipping filter coffee while looking onto the empty streets in a complete lock-down that started last week.

We are all trying to rapidly adjust to a new workflow from our makeshift workstations. As this crisis unfolds, we are seeing a drastic change in the remote infrastructure as it is pushed to its limit. This is creating an unprecedented security challenge that may not be a priority for many companies.

I recently got an alert to say that Microsoft Teams, a platform to help manage remote workers and collaboration, crashed under the new pressure of millions of users. Likewise the video streaming company Zoom, which is now facing serious security breaches, went from 10 million daily users in December to 200 million daily users in March! These are obvious signs of the strain we are putting on digital infrastructure as this crisis unfolds. Teams that are only familiar with being centralized on a secure network, are now at home on a private WiFi network trying their best to keep websites, services, platforms and apps functioning.

Image taken from Downdetecter, 16 March 2020

Why does remote work present a security risk?

Remote working is not necessarily less secure, if done right. It is extremely important for remote teams to have operational frameworks and tools in place to deal with the challenges that are unique to remote teams. Especially when it comes to issues as urgent as security.

If you make secret management too painful for developers, it becomes all too tempting to circumvent the restrictions. Now let's add to this balancing act different geographies, time zones and unreachable co-workers. The temptation to send a secret over Slack, email or to quickly hardcode secrets into the source code becomes increasingly hard to resist.

In addition to internal operational control, we must now also consider external factors, such as unsecured personal wifi-networks and personal machines. These all add to the potential of secrets being leaked into unsecured locations.

The once clear delineation between personal and professional activity suddenly becomes blurred when you are sharing a single machine and mixing personal git repositories with professional repositories. You can easily find yourself in the worst case (yet common) scenario of mistakenly uploading secrets and corporate code into personal repositories. All these factors combined increases the attackable area of the organization, when secrets that were once secured to a central location are now sprawled across multiple platforms and tools.

Unsecured messaging systems like Slack and email are known to be high-value targets for attackers through phishing campaigns. Once an attacker has admin access to your company's Slack, they are a simple search away from gaining access to your sensitive information. The normal conditions that could help identify an external threat also begin to unravel, as IP addresses of all developers can change daily. Currently, the number of reported phishing campaigns has drastically increased with the COVID-19 crisis unfolding as attackers piggy-back onto public panic.

8 steps to making remote work more secure for development teams

1. Get visibility over secrets shared in your repositories

Repositories are not appropriate places to store secrets, it is important that you have complete visibility of not only public repositories but also of your private repositories. Having secrets in private repositories:;

increases the chances of a secret ending up on a public repository.
Perhaps more importantly though, now that employees are increasingly working from home, it is more likely company code will end up on unsecured personal workstations or be transiting through unsecured networks, meaning secrets within that source code could be accessible by a malevolent actor.

We recommend signing up to GitGuardians Internal Repository Monitoring to scan your repositories. The tool is free for small teams and will allow you to scan your git history for secrets while continuously scanning incremental changes as well.

2. Establish best practices sharing sensitive information

For a helpful guide on best practices for securing and sharing API keys, you can check out a GitHub cheatsheet here. It’s a great document to share with your staff to make sure they are using the best practices.

3. Know how to remediate compromised credentials

Make sure the steps on how to revoke a credential for each external service is documented along with the impact revoking each credentials will have. You can find here an article about revoking and removing compromised credentials

4. Invest in multi-factor authentication on all services.

2 factor authentication improves security drastically when it comes to services and assets you want to protect. Make sure 2FA is set up on all “Seed” accounts like company emails. (A seed account is an account that can grant access to other services. Email and social media accounts are examples of seed accounts, “log in with Gmail”).

5. Work towards implementing a Zero Trust framework

Depending on the organization it may be appropriate to implement a Zero Trust framework beyond multi-factor authentication. Complete implementation is always a process and a considerable investment but introducing additional Zero Trust measures like SSO and introducing policies to block/limit/approve access decisions to enterprise resources can strengthen security significantly, particularly for remote teams.

6. Encourage a culture of self education

By providing clear and basic information, including how to protect their devices, will help you and employees stay ahead of threats.

Remote workers have access to data, information, and your network. This increases the temptation for bad actors. Warn your employees to expect more phishing attempts, including targeted spear phishing aimed at high profile credentials. Now is a good time to be diligent, so watch out for urgent requests that break company policy, use emotive language and have details that are slightly wrong and provide guidance on where to report those suspicious messages.

7. Practice good security hygiene

Good security practices always go back to basics. Practicing good security hygiene means to rotate vulnerable access credentials regularly and to restrict permissions associated with your keys with role-based access controls.

8. Be prepared for a breach

Regardless of whether your team is working remotely or not, it is important to have an action plan in place in case secrets get shared. Make sure the whole team understands what that plan is and implement a culture of open, transparent communication.

We are certainly facing difficult times that are bringing new challenges at record speeds. It is unfortunate that while we are adapting, there are people out there that are using this as an opportunity to exploit us. GitGuardian is, and always will be free for developers and small teams so if you are working with a remote team, consider implementing the free tools GirGuardian have to allow you to continue thriving during the times ahead.

👉Try GitGuardian | automated secret detection

Free 0$/month
Monitor your public activity anywhere on GitHub
Real-time alerting by email

article originally published on GitGuardian's blog: 8 Steps to keep remote development teams secure

Exposing secrets on GitHub: What to do after leaking credentials and API keys

Jean — Wed, 25 Mar 2020 14:00:10 +0000

As a developer, if you have discovered that you have just exposed a sensitive file or secrets to a public git repository, there are some very important steps to follow.

What is a secret? In this document when we use the term secret, we are referring to anything that is used to authenticate or authorize ourselves, most common are API keys, database credentials or security certificates.

The first step, breathe: in most circumstances, if you follow this guide carefully, it will only take a few minutes for you to nullify most of the potential damage. This post will go through the four steps needed to remove the risk and make sure it doesn't happen in the future.

1. Revoke the secret or credentials
2. (Optional) Permanently delete all evidence of the leak
3. Check access logs for intruders
4. Implement future tools and best practices

If I delete the file or repository then I’m safe right?

Unfortunately, no. If you make the repository private or delete the files you can reduce the risk of someone new discovering your leak but the reality is that it is very likely your files will still exist for those that know where to look. Git keeps a history of everything you do, so even if you delete the file, it will still exist in your git history. Even making the repository private, deleting your history or even deleting the entire repository, your secrets are still compromised.

There are easy ways to monitor public git repositories, for instance, GitHub has a public API where you can monitor every single git commit that is made. That means that anyone can (and they do) monitor this API to find credentials and sensitive information within repositories. For this reason it is best to assume that if you have leaked a secret, it is compromised forever.

Leaking secrets onto GitHub and then removing them, is like accidentally posting an embarrassing tweet, deleting it and just hoping no one saw it or took a screenshot.

Step 1. Revoke the secret and remove the risk

The first thing we need to do is make sure that the secret you have exposed is no longer active so no one can exploit it.

If you need specific advice on how to revoke keys you can see instructions within our API Best Practices Document.

If this key belongs to an organization you work for then it is important to speak to the senior developers at your organization. It can be scary to let your company know you have leaked sensitive information, particularly if it has happened on a personal repository. But honesty is the best approach, it is possible the company has already discovered the leak, mistakes can be forgiven if the problem is resolved and you genuinely care.

Don’t hope that the problem will go away - mistakes happen and it is best to simply be honest and upfront.

Did you know?
Slack keys are among the rare API tokens that have the ability to autorevoke themselves! As simple as using the auth.revoke endpoint!

Step 2. (Optional) Get rid of the evidence

Once the secret has been revoked, it cannot be used anymore. Nonetheless, having any credential, even an expired one, could look unprofessional and raise concerns. Additionally, there are some secrets that cannot be revoked (for example database records), or credentials no one can guarantee were properly revoked (for example, SSH keys that can be used in many different places). So we will go through the steps of how to remove history of it, please note that this is not a trivial task and it is advised to seek the guidance of a senior developer.

a. Either delete your repository or make it private

It is often a good idea to buy yourself some time first: navigate to your GitHub repository then click "Settings".

Then, all the way down to the "Danger Zone" and click "Make Private" to hide the repository from the public.

Note: if you may wish to make a backup then click on "Delete this repository". You will push it back later.

You can push the repository back later.

b. Rewrite git history

⚠️ Warning Before jumping in, be advised that rewriting .git history is not a trivial act, especially if there are many developers contributing to your repository. You will either have to completely delete the repository then push the cleaned version back, or git push --force to your initial repository. In either cases, you’ll completely break other contributing developers’ workflow.

We are going to use the well-known BFG Repo-Cleaner

The BFG is a simpler, faster (10 - 720x faster) alternative to git-filter-branch for cleansing bad data out of your Git repository

Let's suppose you committed a sensitive file called config.py and this contains a secret key.

i. Make sure you have java installed.

ii. Clone your repository

git clone https://github.com/YOUR-USER-NAME/YOUR-REPOSITORY.git

iii. Delete the Sensitive file

The very latest commit on your current branch is protected by BFG so you have to make sure it is clean. Delete "config.py" and commit your changes

Branches different from the current one are not protected so if "config.py" can be found on other branches, it will be cleaned by BFG

git commit -m "clean commit"

iv. Run BFG

Download the latest version of BFG from their website, move the java file into your repository and run the command below

Note:replace bfg-VERSION with the latest version (bfg-1.13.0)

java -jar bfg-VERSION.jar YOUR-REPOSITORY/.git --delete-files "config.py"

vi. Check your history.

You can use the log - p command to show the difference (called a "patch") introduced in each commit. If you navigate through your different branches, you should see that everything is fine.

git log -p

vii. Push your repository back

Create a new repository and push it back. Make sure everybody deleted old clones and is using your new version.

Did you know?
Even though git prevents you overwriting the central repository’s history by rejecting non-fast forward push requests, you can still push changes using the --force flag. This forces your remote branch to match your local one. Be careful though, this is a dangerous command!

3. Check your access logs!

This is very important depending on the keys that were leaked. Sometimes when one access key leaks it creates a domino effect and leads to exposing new secrets. For example, an access key to Slack may give a bad actor access to messages containing new credentials and access codes so very important to make sure that there is no suspicious data!

Checking access logs really depends on the type of credential that was leaked. For example, AWS logs are sometimes centralized into Cloudwatch. Slack has a dedicated API endpoint that allows access to audit logs. This is probably a good moment for you to get closer to your SRE or Application Security team to make sure everything’s fine!

What now?

So the credential has been revoked, the repositories history has been cleaned. What should I do now? Now you have had a good scare, it is a good time to start to implement some good practices.

1. Get Protected with GitGuardian

GitGuardian is a good guy service that scans every single GitHub commit in public repositories in real-time for leaks.

Sign up to GitGuardian

2. Review API Best Practices

To better protect your secrets in the future we advise that you look at our API Best practice guide. It has lots of helpful tips on how to make sure you don't accidentally leak secrets in the future.

Author's note: this post was originally written on GitGuardian's blog.

Automatically detect secrets in your internal repos

Jean — Wed, 19 Feb 2020 10:44:19 +0000

At GitGuardian, we’ve been monitoring every single commit pushed to public GitHub since July 2017. 2.5 years later, we’ve sent over 500k alerts to developers.

API keys, database connection strings, private keys, certificates, usernames and passwords, … As organizations embrace the power of cloud architectures, SaaS integrations and microservices, developers handle increasing amounts of sensitive information, more than ever before.

To add to that, companies are pushing for shorter release cycles to keep up with the competition, developers have many technologies to master, and the complexity of enforcing good security practices increases with the size of the organization, the number of repositories, the number of developer teams and their geographies…

As a result, secrets are spreading across organizations, particularly within the source code. This pain is so huge that it was even conceptualized under the name “secret sprawl”.

After months of product iteration with security teams and developers, we’re now proud to officially introduce GitGuardian for internal repositories!

Credentials in private repositories: how much should you care?

Secrets stored in Version Control Systems is the current state of the world, yet VCSs are not a suitable place to store secrets for the following reasons:

Everyone who has access to the source code has access to the secrets it contains. This often includes too many developers. It would just take a single compromised developer’s account to compromise all the secrets they have access to!
You never know where your source code is going to end up. Because of the very nature of the git protocol, versioned code is made to be cloned in multiple places. It could end up on a compromised workstation, be inadvertently exposed on public GitHub, or released to customers.

Storing secrets in source code is a bit like storing unencrypted credit card numbers, or usernames and passwords in a Google Doc shared within the organization: good friends would not let you do this!

As a developer or security professional, what should I do after a secret was pushed to a centralized version control?

Every time I see a secret pushed to the git server, I consider it compromised...From one developer to another :)

When a secret reaches centralized version control, it is always a good practice to revoke it. At this point, depending on the size of your organization, remediating is often a shared responsibility between Development, Operations and Application Security teams.

Indeed, you might need some special rights and approval to revoke the secret, some secrets might be harder to revoke than others, plus you must make sure that the secret is properly rotated and redistributed without impacting your running systems.

Apart from that, depending on your organization’s policies, you might want to clean your git history as well. This will require a ‘git push --force’, which comes with some risks as well, so there is definitely a tradeoff to consider, with no correct answer!

(Hint: if your secret is buried deep in your code, BFG Repo-Cleaner is a great Open Source project to help you get rid of it without having to use the intimidating ‘git-filter-branch’ command. Plus it is in Scala! We have Roberto Tyley to thank for this.)

When should I do secret detection?

With the nature of git comes a unique challenge: whereas most security vulnerabilities only have the potential to express themselves in the actual (and deployed) version of your source code, old commits can contain valid secrets, including deleted secrets that subsequently went unnoticed during code reviews.

First, you want to make sure that you start on a clean basis by scanning existing code repositories in depth.

Then, you want to continuously scan all incremental changes, ie every new commit in every branch of every repository.

When to do incremental scanning?

In his presentation about “Improving your Security Posture with the Cloud”, Sébastien Stormacq, Developer Evangelist @ AWS, advocates to implement security checks post-event in every case, and pre-event when possible.

We at GitGuardian share Sébastien's views. You should always implement automated secrets detection server side, in your CI/CD for example or via a native integration with GitHub / GitLab / Bitbucket repositories. Also, it is good to encourage your fellow developers to implement pre-commit hooks, but we often hear that this is hardly scalable across an entire organization.

Try it out!

Our product will allow you to scan existing code as well as incremental changes, and benefit from secrets detection algorithms that were battle-tested at scale on the whole public GitHub activity for over two years! GitGuardian has a native integration with GitHub (GitLab and Bitbucket coming soon), and there is an on prem version available.

We offer a free version of our solution for individual developers and Open Source organizations, as well as a free trial for companies that you can access in SaaS here:
https://dashboard.gitguardian.com/auth/signup.