DEV Community: Ferdinand Boas

Comparing Secrets Detection Solutions? Here’s Why You Should Use the F1 Score

Ferdinand Boas — Mon, 03 Feb 2025 15:25:54 +0000

As organizations increasingly adopt DevOps practices, the need for reliable secrets detection solutions has never been greater. However, not all tools are created equal. With so many options available, how can you determine which one best answers your needs? Comparing tools can be surprisingly complex.

In this post, we’ll introduce the F1 score as the most reliable metric for evaluating their performance. By balancing recall (how many valid secrets a tool catches) and precision (how often the secret is valid), the F1 score provides a comprehensive and fair way to assess the effectiveness of secrets detection solutions.

Why Comparing Secrets Detection Solutions is Hard

When evaluating secrets detection solutions, two core metrics play a pivotal role:

Recall: Measures how many valid secrets a tool successfully detects out of the total secrets that exist. A tool with high recall is less likely to miss exposed secrets.
Precision: Indicates how often a detected secret is actually valid. High precision means fewer false positives.

Recall and Precision role in secrets detection.

Achieving high recall without sacrificing precision is notoriously tricky. Tools with high recall but poor precision risk overwhelming users with false positives, creating inefficiencies, and eroding trust in the solution. On the other hand, tools with high precision but low recall might miss critical secrets, leaving organizations vulnerable to breaches.

Finding a solution that strikes the right balance between these two metrics is essential for effective secrets detection.

Introducing the F1 Score

Imagine a secrets detection solutions with precision = 90% and recall = 40%. If we calculate the arithmetic mean of these metrics to evaluate the overall performance of this tool, we will get:

\frac{Precision + Recall}{2} = \frac{90 + 40}{2} = 65

At first glance, this 65% score seems reasonable. However, it overemphasizes the higher metric (precision) and ignores the gap in recall, which reflects missed critical secrets.

Other types of means exist. The harmonic mean, rather than the simple arithmetic mean, is particularly useful for datasets with imbalanced metrics, like when precision and recall vary significantly. Here, it would be

Harmonic Mean = \frac{2 \times ( Precision \times Recall )}{Precision + Recall} = \frac{2 \times ( 90 \times 40 )}{90 + 40} = 55.38%

This 55% score feels very different; you probably won’t trust this solution.

This score highlights the real-world challenge: high performance in one area cannot compensate for poor performance in another.

That’s how the F1 score was built. The F1 score is the harmonic mean of recall and precision, widely used when working on predicting algorithm.

F1 = \frac{2 \times ( Precision \times Recall )}{Precision + Recall}

By penalizing extreme differences between recall and precision, the F1 score allows organizations to compare solutions fairly without overemphasizing one parameter at the expense of the other.

Good F1 scores are in red, bad ones in blue. Being in the red zone requires both good precision and recall. The F1 score of the evaluated tool is plotted here, among the green scores, far away from the red scores.

By balancing the risks of missed secrets and team overload, the F1 score provides a practical way to rank tools based on their overall performance.

When You Might Prioritize Recall or Precision Over F1

There are cases where one metric matters more than their combination:

Focusing on Recall

In high-stakes scenarios like post-breach investigations or audits, missing even one secret can be catastrophic. Maximizing recall ensures every potential secret is flagged, even if it results in more false positives that require investigation.

Focusing on Precision

When reducing false positives is critical—such as in a SOC handling numerous alerts or workflows that automatically generate Jira tickets upon incident detection—high precision helps prevent alert fatigue and streamlines operations. Slightly lower recall is acceptable if it means focusing on actionable incidents without overwhelming teams.

Ultimately, your priorities—avoiding noise or catching every secret—should guide whether precision or recall takes precedence. For most mature enterprises, both high recall and precision are required in the long run, meaning a high F1 score.

How to Compare F1 Scores and Performance of Solutions

Evaluating the performance of secret detection solutions is inherently challenging because there is no perfect, hand-labeled dataset of secrets available for benchmarking. The main reasons include:

Data Sensitivity – No organization will publicly share a dataset containing real secrets, as doing so would be a major security risk.
Dataset Limitations – Any publicly available dataset will either be artificially generated or fail to replicate the real-world conditions of an organization.

As a result, benchmarks must rely on actual datasets from real-world scenarios. This introduces a fundamental limitation: we cannot determine the absolute recall of a tool because the total number of real secrets in the dataset is unknown. Labeling datasets at a scale large enough to perform unbiased comparison would be prohibitively time-consuming and resource-intensive.

While this approach does not provide an absolute measure of performance, it allows for meaningful comparisons between solutions by focusing on their effectiveness in detecting real secrets under the same conditions. When a solution detects a secret, we can validate it by attempting to use the secrets (ethically and in controlled environments) to confirm whether they work or, if feasible,by checking with the platform or service provider to confirm validity.

How to Compute the F1 Score: An Example

Let’s take an example of four different secret detection solutions evaluated on the same dataset of real-world data. From this evaluation, you can gather two key pieces of information:

The total number of secrets detected by each solution (their findings).
The number of valid secrets detected (true positives), after testing and validating each finding.

Here are the results from this experiment.

Solution	Total of secrets detected	Number of valid secrets detected (True Positives)
Solution A	3,000	2,700
Solution B	9,000	6,500
Solution C	7,000	6,000
Solution D	5,000	4,600

Step 1: Calculating Precision

Precision measures how many of the detected secrets are valid. It is calculated using the formula

Precision = \frac{True Positives}{True Positives + False Positives}

In other words, it’s the ratio of valid secrets to the total secrets detected by the tool.

Solution	Number of valid secrets detected (True Positives)	Number of invalid secrets detected (False Positives)	Precision
Solution A	2,700	300	90.0%
Solution B	6,500	2,500	72.2%
Solution C	6,000	1,000	85.7%
Solution D	4,600	400	92.0%

If we look at precision alone, Solution D appears to be the best, with 92% of its findings being valid. However, it also misses many valid secrets that were detected by Solutions B and C. That’s the reason why you need to consider their recall as well.

Step 2: Calculating Recall

Recall measures how many of the actual secrets in the dataset are detected. It is defined by

Recall = \frac{True Positives}{True Positives + False Negatives}

It’s the Number of valid secrets detected divided by the Total number of actual secrets in the dataset.

Here, we face a challenge: determining the total number of actual secrets in datasets large enough for meaningful comparisons, is prohibitively expensive. This number remains unknown in real-world scenarios because no hand-labeled dataset fully replicates actual data. So, how do we calculate recall?

Let’s look at the distribution of secrets.
When secret detection solutions analyze a dataset, they do not detect the identical set of secrets. Each solution has different detection capabilities, leading to variations in their findings. Some secrets may be detected by multiple solutions, while others might be unique to a single tool. This overlap, and the differences between findings, are key to understanding the dataset's composition.

Example distribution of secrets detected by various solutions benchmarked.

Since we do not have a ground-truth dataset with pre-labeled secrets, the best assumption we can make is that the total number of actual secrets in the dataset corresponds to the union of all valid secrets detected across all solutions.

In this example, there are many findings\

Total Findings = 100 + 2700 + 400 + 600 + 300 + 4000 + 2000 = 10, 100

To determine the actual number of secrets in the dataset, all detected findings need to be validated. While this approach is an approximation, it is the most reliable method available. Even when combining the findings from all solutions, some secrets may still be missed. However, this approximation provides a solid baseline for fairly comparing the performance of each solution relative to the others.

For this example, we assume the dataset contains 8,000 actual secrets—a reasonable midpoint based on validation results.

Interestingly, Solution B reported 9,000 findings, more than our estimated total of actual secrets. This suggests that Solution B’s detectors over-identified secrets, likely misclassifying some non-sensitive data as secrets, generating even more false positives.

Here’s the recall calculation for each solution.

Solution	Number of valid secrets detected (True Positives)	Number of valid secrets NOT detected (False Negatives)	Recall
Solution A	2,700	5,300	33.8%
Solution B	6,500	1,500	83.1%
Solution C	6,000	2,000	75.0%
Solution D	4,600	3,400	57.5%

If we look at recall alone, Solution B is the top performer, detecting 83.1% of the actual secrets. However, it comes with a drawback: 2,500 false positives, which could overwhelm security teams with invalid alerts.

Step 3: Calculating the F1 Score

Now, let’s compute the respective F1 score of all solutions.

Solution	Recall	Precision	F1 Score
Solution A	33.8%	90.0%	49.1%
Solution B	83.1%	72.2%	76.5%
Solution C	75.0%	85.7%	80.0%
Solution D	57.5%	92.0%	70.8%

Solution C has the highest F1 score in this example, making it the best overall solution. Although it doesn’t have the highest recall or precision, it strikes the best balance between the two, detecting many valid secrets while keeping false positives relatively low.

This demonstrates the importance of looking beyond a single metric (precision or recall) and considering the F1 score when comparing secret detection solutions.

Adding Some Nuance: Why Some Secrets Matter More

Not all secrets carry the same level of risk. Cloud provider credentials and identity service secrets are among the most critical for enterprises and other organizations, as they can grant attackers broad access to highly sensitive systems. In contrast, an API key for a music streaming service may pose minimal risk to an organization.

To assess a tool’s effectiveness, enterprises may prioritize evaluating its performance on the types of secrets that matter most to them. This means focusing on categories of high-risk secrets—such as cloud credentials, database passwords, or private keys—rather than treating all detections equally.

How to Survive Low Precision with Efficient Prioritization

A tool with poor precision can still be effective if it has strong recall and the right prioritization strategies. Security teams, especially in large enterprises, focus on high-impact incidents first, ensuring critical secrets are addressed before lower-priority findings.

Efficient Triage: Focusing on Critical Leaks

Not all exposed secrets pose the same risk. A cloud provider API key is far more dangerous than a test credential. Enterprises prioritize based on impact, so a tool with high recall ensures critical leaks aren’t missed, even if some false positives need filtering.

Prioritization Helps But Isn’t a Silver Bullet

Unlike provider-specific secrets, generic secrets lack explicit identifiers, making them harder to classify automatically. Prioritization algorithms will not prioritize them correctly.

Moreover, even with prioritization, poor precision leads to alert fatigue. The best defense is high recall with strong precision—catching real secrets while minimizing noise.

Looking for the Best Solution

Choosing the best secrets detection tool isn’t just about finding the one with the highest recall or the fewest false positives—it’s about balancing precision and recall to ensure real-world effectiveness. The F1 score is the most reliable way to compare solutions fairly, as it accounts for both factors. The ideal solution finds the right trade-off, maximizing true positives while minimizing noise. However, it's important to remember that even a single missed secret can lead to a breach, so solutions with poor recall should be approached with caution.

It's also important to consider the context. Organizations should tailor their evaluation based on the types of secrets that pose the highest risk to them. A solution that performs well on highly critical secrets, such as cloud access keys or authentication tokens, will provide more security than one that simply maximizes detections without differentiation.

Finally, beware of tools that claim near 100% recall—such a claim likely indicates a lack of real-world constraints. This would mean the solution is not designed to be as efficient with actual data. The best secrets detection solution isn’t the one that flags the most findings; it’s the one that helps your security team take the right actions.

From False Positives to Potential Breaches: The Risks of Prematurely Closing Incidents

Ferdinand Boas — Sun, 20 Oct 2024 17:36:33 +0000

In the fast-paced world of software development, efficiently managing security incidents is crucial for maintaining a robust security posture. Automated secrets detection solutions are pivotal in identifying and alerting teams to exposed secrets within code repositories. Deciding whether to resolve or ignore these incidents can be challenging. Ignoring an incident is reasonable when a secret is low-risk or a false positive. However, without careful consideration, this action can introduce significant risks, potentially leaving the organization vulnerable to security breaches.

In 2024, a publicly listed technology company experienced a breach when a thief stole an employee's developer's Access Token. It gave them access to a codebase containing an active credential that had been incorrectly labeled as a test credential. Thankfully, the damage was minimal: a forensic investigation revealed that the intruder only accessed a small portion of the production environment and obtained personal information on a limited number of individuals and dummy data.

In this blog post, we'll explore the risks of ignoring an incident and discuss best practices to ensure such decisions do not compromise your organization's security.

The risks associated with ignoring an incident and premature closure

Ignoring a secret typically occurs when the identified secret poses low/no significant security risk. Common reasons to ignore a secret include:

The secret was used for testing purposes: During development or testing phases, developers might use dummy or placeholder secrets to simulate real-world scenarios. These secrets are not meant to be used in production environments and are often considered harmless.
The rationale for ignoring: Since these secrets are not tied to sensitive data or production systems, they might be deemed safe to ignore.
The secret is a false positive: GitGuardian may sometimes flag non-sensitive information as a secret due to pattern-matching algorithms. For instance, a string that looks like an API key or a password might actually be something entirely benign, like a configuration identifier or a comment in the code.
The rationale for ignoring: False positives can be ignored to reduce noise and focus on real security threats.
The secret has a low risk: Not all secrets are created equal; some may have a low impact if exposed. For example, a secret might only grant access to a non-critical service or a sandbox environment with minimal risk to the organization.
The rationale for ignoring: If a secret's potential impact is determined to be negligible, it might be ignored to focus resources on more critical issues.

Even diligent teams can sometimes ignore or prematurely dismiss security alerts. Understanding and addressing the causes is crucial for improving incident management.

Alert fatigue: GitGuardian may be just one of the numerous security tools security teams utilize. Some of these tools raise an overwhelming number of alerts, leading to desensitization and the potential to overlook real threats.
For instance, in 2017, Equifax overlooked some security alerts, resulting in the breach of around 150 million private records. After litigation, the total cost of the settlement exceeded $500 million.
Lack of training: Inadequate training or awareness among team members contributes to misclassifying incidents as low-risk or false positives.
All teams don't learn the same way; that's why GitGuardian provides continuous education and training on how to reduce these errors through different media.
Resource constraints: Limited resources—such as time, personnel, and budget—can lead to shortcuts in incident management and premature closure of incidents.
When dealing with a pile of security debts, organizations can adopt strategies to manage these constraints without compromising security.
Human bias: Cognitive biases, such as confirmation bias or optimism bias, might cause security teams to downplay certain alerts. For example, if a team member has encountered similar alerts that turned out to be false positives in the past, they might prematurely dismiss a legitimate threat.

Out of about 300,000 ignored incidents by GitGuardian users, the secret was still valid in 4.7% of cases. When taking into account all 650,000 closed incidents, this number increases to 6%. This means tens of thousands of closed incidents could be potential vulnerabilities!

Ignoring a security incident without thorough consideration can pose significant risks.

Overlooking emerging threats: Cybersecurity threats are constantly evolving, and what might seem like a low-risk or false positive today could be exploited by attackers in new and unforeseen ways. Teams might assume all problems are under control, with the risk of deploying compromised code into a production environment.
Inconsistent security practices: Regularly ignoring secrets without a standardized review process can lead to inconsistent security practices across the organization. Different team members might have varying thresholds for what constitutes a low-risk secret. This inconsistency can create gaps in security defenses, where some risks are ignored without proper justification.
Compliance and audit challenges: Security audit trails are essential for maintaining an accurate historical record of security incidents and their resolutions. Failing to demonstrate thorough review and justification for ignored incidents can result in failed audits, regulatory fines, the inability to trace the root cause of a security breach, or the need for costly remediation efforts.
Missed opportunities for security improvement: By ignoring incidents, organizations may miss the chance to refine their security processes, update playbooks, or adjust scanning tools to reduce false positives, ultimately weakening overall security resilience.

Organizations should establish clear guidelines and best practices for incident review to mitigate these risks and ensure that all security incidents are carefully evaluated before being ignored or closed.

Navigating incident closure in GitGuardian

Properly resolving incidents strengthens security, while incorrectly ignoring or closing them can compromise the organization's defenses. Therefore, it is crucial to comprehend the context and consequences of each action to maintain a robust security posture.

When remediating an incident in GitGuardian, there are two ways to close an open incident:

Resolve: When an incident is resolved in GitGuardian, it means that the underlying security issue has been addressed, and the risk has been mitigated. This usually occurs when an exposed secret is rotated. The resolution typically involves verifying that the issue no longer poses a threat and documenting the steps taken to fix it.
Ignore: Ignoring an incident is appropriate when the identified issue is a known false positive or does not pose a significant security risk.

Users may mistakenly close incidents. That's why GitGuardian sends notifications to their team managers (or the workspace manager if the user who closed the incident is not part of a team) when an incident is ignored while the secret is still valid.
These notifications are meant to encourage further review and prevent incidents from being overlooked, providing an extra level of protection against premature closures.

Furthermore, you can audit closed incidents using Saved views: one of the default incident views is dedicated to closed incidents with valid secrets.

What is true today may not be true tomorrow. Some services, such as AWS, may allow users to temporarily deactivate secrets.
When a secret has been exposed, GitGuardian will continue to monitor its validity, even if the incident is closed. An invalid secret may become valid later. If this occurs, GitGuardian will create new incidents to avoid unpleasant surprises.

Best practices for closing incidents in GitGuardian

GitGuardian is convinced that detection without remediation is just noise and want to help you close incidents more efficiently. More specifically, you should follow these guidelines before ignoring or closing any incident:

Verify the incident's nature: Confirm whether the incident is a real threat, a test, or a false positive.
Use playbooks: Consider employing a playbook to standardize the process of managing and closing incidents. GitGuardian offers a default remediation playbook and the option to create custom playbooks. A well-defined playbook ensures that all necessary steps are taken, reducing the risk of human error.
Team collaboration: Encourage collaboration with team members and security experts when unsure how to handle an incident. Leveraging the collective knowledge of the team can lead to better decision-making and prevent oversights.
Assess the risk: Evaluate the potential impact of the incident on the organization's security.
Implement remediation: If the incident is valid, take the necessary steps to remediate the issue.
Keep detailed records: Thoroughly document the reason for closing or ignoring an incident. This documentation is crucial for future reference, especially during audits or similar incidents. Additionally, other team members can comprehend the rationale behind the closure if they revisit the incident later.
Double-check: Before closing any incident, verify the issue has been fully resolved or correctly classified. It means confirming that exposed secrets have been rotated or ensuring that the incident's root cause has been addressed.

Conclusion

While some secrets may be safely ignored, this decision should never be made lightly. The potential risks of ignoring genuine security threats far outweigh the convenience of reducing alert noise.

By following best practices, you can balance efficiency and security, ensuring your codebase remains well-protected against current and emerging threats.

How I almost won an NLP competition without knowing any Machine Learning

Ferdinand Boas — Tue, 10 Aug 2021 13:51:28 +0000

One of the cool things about Machine Learning is that you can see it as a competition. Your models can be evaluated with many performance indicators, and be ranked on various leaderboards. You can compete against other Machine Learning practitioners around the world, and your competitors can be a student in Malaysia or the largest AI lab at Stanford University.
Kaggle started as a platform to host such Machine Learning contests, and it gained a lot of attention from the data science community. The best data scientists exhibit on Kaggle their most sophisticated Machine Learning skills, craft the most elaborated models to reign over these competitions.
Kaggle is now a broader platform, where you can enter these competitions but also learn data science, discuss it, and collaborate with fellow data scientists.

Most of the Kaggle competitors are Machine Learning practitioners. Many software engineers do not enter these competitions, mostly because they think that they do not have the needed skill set, tools, or time to be successful in them.

Machine Learning can be hard to learn and use. It’s a very technical field.
Running a Machine Learning project is complex: you will have to gather and clean data, choose a pre-trained model or train a model that suits your needs, fine-tune it for your curated dataset, and deploy the model in a production environment. You will also need to worry about monitoring, scalability, latency, reliability...
This is usually a resource-intensive process, it takes time, knowledge, compute resources, and money. This does not fit well with the regular activities of a software engineer.

At this stage, I need to point out that I am not a data scientist.
You may now wonder how I ranked among the best data scientists in a Kaggle Natural Language Processing (NLP) challenge without using any Machine Learning.
This blog post explains how I successively leveraged Hugging Face 🤗 AutoNLP web interface and 🤗 Inference API to achieve this result.

Find all the scripts and assets used in this GitHub repository.

ferdi05 / kaggle-disaster-tweet-competition

Participating to a Kaggle competition without coding any Machine Learning

The Kaggle competition

Entering a Kaggle competition is straightforward. You are asked to perform a task such as sentiment analysis or object detection that can be solved with Machine Learning. Kaggle provides a training dataset with examples of the task to achieve. You can use this dataset to train a Machine Learning model. Then you can use this model to perform the same task on a test dataset (also provided by Kaggle). This is your attempt at solving the challenge. Then you will submit your model predictions for this test dataset to Kaggle and they will evaluate it and give you a ranking in the competition that you entered.

You will find plenty of NLP competitions on the Kaggle website. I participated in the Natural Language Processing with Disaster Tweets competition as it is quite recent (7 months when writing this post) and has over 3,000 submissions from other teams.
This competition challenged me to build a Machine Learning model that predicts if a tweet is about a real disaster or not.

// Detect dark theme var iframe = document.getElementById('tweet-629010812728963072-339'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=629010812728963072&theme=dark" }

This tweet is not about a real disaster

Kaggle provides a training dataset of around 7,500 tweets (the input object) with their associated label (the desired output value). These labels tell if each tweet is about a disaster (its label is 1) or not (its label is 0). This dataset will be used to train a few Machine Learning models and evaluate them.

Kaggle also provides a test dataset of around 3,200+ tweets without any paired label. We will use the newly created Machine Learning model to predict if they are about a disaster, asking the Machine Learning model to apply labels to each of these tweets.

Both datasets also contain two other data columns that will not be used: a keyword and the location of the tweet.

🤗 AutoNLP web interface to the rescue

The process of training a Machine Learning model is not straightforward. It requires collecting cleaning and formatting data, selecting a Machine Learning algorithm, playing with the algorithm parameters, training the model, evaluating its performance, and iterating. And this does not guarantee that performances will reach your expectations.
This is a resource-intensive process. Fortunately, I used a web interface to do all the heavy-lifting and save hours of Machine Learning-induced head-scratching.

What is 🤗 AutoNLP?

Leveraging its experience with the most performant architectures of NLP, Hugging Face offers the 🤗 AutoNLP web interface to automatically train, evaluate and deploy state-of-the-art NLP models for different tasks. All you need to do is feed it your datasets.

🤗 AutoNLP uses supervised learning algorithms to train the candidate Machine Learning models. This means that these models will try to reproduce what they learned from examples that pair an input object and its desired output value. After their training, these models should successfully pair unseen input objects with their correct output values.

🤗 AutoNLP will train a range of NLP models suitable for the task required by the competition and will use a various set of configurations for each of them. Then each model’s performance will be automatically evaluated. I saved a lot of resources and money by avoiding their computer-intensive training.
Later I selected the most performant model to make predictions for the Kaggle competition.

Training Machine Learning models with data only

The competition requires to label each tweet as related to a disaster or not. And binary text classification is one of the tasks achievable with the 🤗 AutoNLP web interface. So I started a new project.

In this competition, Kaggle provides only one training dataset but you need one dataset to train the models (the training dataset) and another one (the validation dataset) to evaluate their performance.
I split the original dataset provided by Kaggle into 2 datasets using a rule of thumb ratio of 80%-20%.

The columns of both datasets need to be mapped. The text column is the input object and the target column is the desired output value. Here the input object is the tweet content, and the output value is its associated label.

Then the web interface started the training and did its magic.

After a few minutes, models were trained, evaluated, and uploaded on the Hugging Face Hub (with private visibility). They were ready to serve, still without any Machine Learning instructions, as you will see later.

For this competition, Kaggle evaluates the performance of the predictions with their F1 score. This is an accuracy metric for a machine learning model. So the best model was the one with the highest F1 score.

Kaggle sometimes evaluates results with more sophisticated metrics. Conveniently 🤗 AutoNLP web interface automatically uploads every trained model’s file on the Hugging Face Hub with their associated card. Each card includes the model metrics (that you may combine according to your need) and code snippets to use the model. And there is even a widget to quickly experiment with the model.

Solving the Kaggle challenge with the 🤗 Inference API

It is now time to use the most performant model on the test dataset provided by Kaggle.
There are two different ways to use the model:

the data scientist way: deploying the model on a dedicated infrastructure, or on a Machine Learning platform
the developer-friendly way of using it: through API calls. This is the one that I will describe here.

Serving Machine Learning models with the 🤗 Inference API

Using Machine Learning models in production is hard, even for Machine Learning engineers:

you may have a difficult time handling large and complex models
your tech architecture can be unoptimized
your hardware may not meet your requirements Your model may not have the scalability, reliability or speed performances that you were expecting.

So I relied on the 🤗 Inference API to use my model, still without coding any Machine Learning. The API allows to reach up to 100x speedup compared to deploying my model locally or on a cloud, thanks to many optimization techniques. And the API has built-in scalability which makes it a perfect addition to a software production workflow, while controlling the costs as I will not need any extra infrastructure resources.

A few API calls to solve the challenge

Let’s call the 🤗 Inference API for each row of the test dataset, and write the output value in the submission file.
I could have used the API via regular HTTP calls, but there is an alternate way: the huggingface_hub library conveniently offers a wrapper client to handle these requests, and I used it to call the API.

import csv
from huggingface_hub.inference_api import InferenceApi

inference = InferenceApi("ferdinand/autonlp-kaggle-competition-6381329", token=API_TOKEN) # URL of our model with our API token
MODEL_MAX_LENGTH = 512 # parameter of our model, can be seen in config.json at "max_position_embeddings"

fr = open("assets/test.csv") # Kaggle test data
csv_read = csv.reader(fr)
next(csv_read) # skipping the header row

fw = open("assets/submission.csv", "w", encoding="UTF8") # our predictions data
csv_write = csv.writer(fw)
csv_write.writerow(['id', 'target']) # writing the header row

#returns a label : about a disaster or not given a tweet content
def run(tweet_content):

   # calling the API, payload is the tweet content , possibly truncated to meet our model requirements
   answer = inference(inputs=tweet_content[:MODEL_MAX_LENGTH])

   # Determining which label to return according to the prediction with the highest score
   # example of an API call response: [[{'label': '0', 'score': 0.9159180521965027}, {'label': '1', 'score': 0.08408192545175552}]]
   max_score = 0
   max_label = None
   for dic in answer[0]:
       for label in dic['label']:
           score = dic['score']
           if score > max_score:
               max_score = score
               max_label = label
   return max_label


for row in csv_read: # call the API for each row

   # writing in the submission file the tweet ID and its associated label: about a disaster or not
   write_row = [row[0], run(row[3])] # row[0] is the tweet ID, row[3] is the tweet content
   csv_write.writerow(write_row)

After running the 🤗 Inference API on all the input data (it may take a while), I ended up with a file that I submitted to Kaggle for evaluation.

This model made it to the top 15% of the competitors with a 0.83 mean score!
At first, I was surprised to not rank higher. Unfortunately, the test dataset and its associated label used for this competition are available publicly. So a few clever contestants submitted it and received an approximate 1.00 score, which is not something realistic in a data science problem.

Having a second look at the leaderboard, I saw that the best data science teams have a 0.85 score. This is very close to the score that I obtained, and another 🤗AutoNLP test may give better results, depending on how lucky I am with the random variations of each model’s parameters. Given the time and resources invested in solving this challenge, this is almost a win!

Do more with the AutoNLP Python package

With the 🤗 AutoNLP web interface, the 🤗 Inference API, and a very few lines of code, NLP models were automatically created, deployed, and used to achieve a great ranking in an NLP competition without learning or using any Machine Learning techniques.

🤗 AutoNLP can also be used as a Python package and can support more Machine Learning tasks than those provided by the web interface - but the interface is quickly catching up. You can use the package to perform tasks like speech recognition and enter even more Kaggle competitions!

If you want to win a Kaggle competition or to train a model for your business or pleasure, you can get started with AutoNLP here.