DEV Community: r2c

Introducing Semgrep and r2c

Pablo Estrada — Thu, 29 Oct 2020 17:19:32 +0000

This post is by Isaac Evans, CEO and co-founder of r2c.

Free, fast, open-source, offline, customizable. These are not often words that describe code scanning tools, and that's a shame.

We founded r2c to bring world-class security tools to developers based on our conviction that software will run the most exciting parts of the future: everything from medical equipment to robots to autonomous cars. The security process should not be the foe but rather the enabler of rapid software development. If developers lack tooling that is easy to set up and understand—or if a developer has to convince their manager to spend a few million dollars on advanced security tools each time they change jobs, the future is bleak.

Before founding r2c, we worked on security and developer tools for large companies and governments. It was eye-opening to see that despite massive budgets, their security programs were generally a generation or more behind the tech giants. When it came to security tools for developers, most teams were jaded about scanning code for vulnerabilities; they hated the tools they had to use and usually ignored them beyond doing the minimum necessary to satisfy a compliance checkbox.

What about code scanning at places like Facebook, Apple, Amazon, Netflix, and Google? They don't generally use traditional commercial security tools which ask "how can we find every bug?" Instead, they focus on custom tooling that can build guardrails for developers. This doesn't require million-dollar tools, PhDs in program analysis, or days of compute time. It looks much more like unit tests for security.

We believe there is a gap between traditional compliance tools and simple linters that's ripe for a new approach, and we were fortunate to find partners from Redpoint Ventures and Sequoia Capital who agreed. With them, we raised a \$13M Series A round of funding to build a security tool that developers might actually love. We've been working on it quietly for a while now, and we're finally ready to announce it to the world!

Semgrep

Semgrep, our open-source product, is specifically designed for eradicating bug classes.
Developers and security engineers can say "this is the safe pattern we always use for (e.g. parsing XML)", write a rule in a few minutes, and enforce that on every editor save, commit, and pull request.

Semgrep is ideal for building security guardrails: start by using frameworks designed with security in mind, then automatically flag code that strays from the secure-by-default path. This is an approach used by Google, Facebook, Amazon, Dropbox, Stripe, Netflix, and others—a topic Clint Gibler and I presented on at Global AppSec 2020. This approach increases developer productivity, reduces attack surface, minimizes the areas for human inspection and audit, and allows the security team to scalably protect code written by thousands of developers.

The idea behind Semgrep is simple: it feels like a regular search (grep) but is syntax-aware. You can learn Semgrep in a few minutes! And Semgrep can be used for more than just security issues: performance, internationalization, or just annoyances committed by accident.

$ semgrep -e foo(1) matches all equivalent variations. See a live example of matching exec calls

What's Next?

Semgrep started as an open-source project at Facebook and we're lucky to have its original author, Yoann Padioleau, on our team at r2c. Since we released the first post-Facebook version (0.4) earlier this year, we've released 25 new versions, added support for 8 new languages, reworked the parsers so we could collaborate with Github on tree-sitter, been joined by thousands of enthusiastic GitHub followers, and seen over 100K pulls of the Semgrep Docker image.

Our roadmap contains more program analysis features to support the sorts of secure-by-default enforcement that large technology companies are already leveraging so heavily (constant propagation, taint tracking, and more), as well as support for many more languages.

Batteries Included

Along with this release of Semgrep, we're announcing the availability of Semgrep Community, a free, hosted service for managing Semgrep CI as well as Semgrep Teams, a paid service which adds additional features for managing Semgrep that are useful for enterprises. Both these offerrings provide SaaS infrastructure for operating a modern AppSec program. They enable central definition of code standards for your projects and show results where you already work: GitHub, GitLab, Slack, Jira, VS Code, and more.

We're also excited that Semgrep Registry already has 900+ rules written by r2c and the community—you can start running on your project right now! Or if you like to DIY, try writing your own.

Pain-free Custom Linting: Why I moved from ESLint and Bandit to Semgrep

Ulzii — Fri, 15 May 2020 21:41:48 +0000

tldr: Semgrep is an analysis tool that is easy to learn and easy to prototype rules with, and can be adopted across languages.

For anyone who is looking to write a rule or sophisticated analysis using a free analysis tool, I wanted to share my experience of writing AST-based visitor rules in contrast to Semgrep rules.

Having written multiple Flake8 rules for Python3, an ESLint plugin, and poked at Go-AST, I have gotten familiar with how many AST-based analysis engines and frameworks work. After writing about 10 AST-based visitors, I was struck with the non-intuitive nature of rule writing, regardless of whether it’s in Go, Python, or JavaScript. In contrast, I have written 40-50 rules in Semgrep in a matter of two months and I am still amazed at the ease of writing rules with it.

For full disclosure, I work at r2c, and we open sourced Semgrep and actively develop it.

Writing code to analyze code != Writing code

When starting with an analysis, one usually has to program an AST-based visitor. If you’re not familiar with what an AST is or what a visitor means, feel free to check out this excellent blog post.

After writing a few visitors, it becomes obvious that the way I write my program is very different from the way I write program analysis for my program. When I write a visitor, I am essentially writing a graph algorithm that visits nodes in that graph and does certain logic.

One of the core advantages for me in writing analysis with Semgrep is that I don’t have to be in that mental model of graph algorithms.

I can actually reason about my analysis in the way I write my code.
To clarify the difference of mental model, consider writing analysis to match variable declaration like my_var = myvar().

In a typical AST based analysis, I’ll write a function that visits each statement in the AST of the program and programmatically specifies when to fire the rule.

...
def visit_Assign(node: AST.node link ):
    # Logic of Flake8 rule goes here
...

...
module.exports = {
    create: function(context) {
        return {
            `VariableDeclarator`: function(node) {
            // Logic of the ESLint rule goes here
            },
        };
    }
};

With Semgrep, I write my analysis in the way I would write my code.

my_var = $Y

Given this similarity of mental models for writing the code and the analysis for it, Semgrep lends itself as easy-to-learn and easy-to-prototype rule writing engine.

Overview of Semgrep

Without diving into the details, the core design decisions made in Semgrep are as follows:

Metavariables: used to track a variable across a specific code scope.
... (ellipsis) operator: it abstracts away sequences so I don’t have to sweat the details of a particular code pattern. Namely, this implies that even my simple rules can match very complex code blocks. Hence, less is more, when writing Semgrep.
smart matching: Semgrep uses different pattern matchers depending on the code pattern I write. If I want to target function like def $FOO(...): ... it will match function declarations. If I want to match statements with patterns like $FOO = exec(...), it will match only statements.

Personally speaking, Semgrep has the mantra of “learn once, write anywhere,” as I can very easily adopt my analysis for other languages. It’s worth noting that core of Semgrep engine was written at Facebook, a company that is known for the“learn once, write anywhere” mantra of React and React Native.

Semgrep vs AST based analysis frameworks

r2c previously talked about how hardcoded password checks is a common and noisy rule. While most rules optimize for completeness, we find that precision is just as important if not more.

For the sake of argument, lets say I was to write a rule to detect hardcoded passwords within Semgrep and compare the ease of development with other AST-based analysis frameworks.

Language written	Framework	Link	Line of Code
Python	Bandit	B105	144
Go	Gosec	G101	119
YAML	Semgrep	Protoype	16

Because I don’t have to write boilerplate code at all, the analysis written in Semgrep is significantly (5 -10x) shorter. In addition, the expressive power of abstractions like metavariables and ellipsis operators in my analysis saves the additional code I need in other frameworks. And unlike other frameworks, because the matching engine of Semgrep smartly determines the type of visitor to use, I don’t have to programmatically write the types of nodes to visit explicitly. Given all of this, it’s easy to iterate and reduce false positive rates extremely quickly.

Lastly, just by simply changing the target language of my rule, I can actually adapt this Go rule to be used for Python or JavaScript. In contrast, if you were to adopt a Bandit rule for JavaScript, you’ll mostly likely have to rewrite it from scratch.

Semgrep vs grep-based tools

Grep-based tools like Ripgrep have been used extensively in code analysis. However, the structure-agnostic nature of the grep tools make analysis prone to false positives.

For example, if I simple want to find instances of sensitive function calls like exec(...), the Semgrep pattern exec(...) matches exec() called with any arguments or across multiple lines, but not the string "exec" in comments or hard-coded strings, because Semgrep is aware of the code structure.

Having to specify grep patterns that only fire inside function calls would be very complicated to say the least, and impossible to say the worst.

Semgrep niceties

Beyond pattern matching, Semgrep offers a very robust set of features for complex analysis. These sets of features make it extremely easy to do robust static analysis in less than half the time it takes using other static analysis tools.

Types

For any metavariable I use, I’m able further hone my analysis with type hints. Currently, I may use int, float, and string literals and formatted strings.

For example, this check will only fires on time.sleep($X: float) , but not on time.sleep(foo()).

Module path

Another advantage of Semgrep is that it’s smart about module paths, such that I can target the specific object I care about in my analysis.

For example, when I was writing a rule to target [HttpResponse](https://docs.djangoproject.com/en/3.0/ref/request-response/#httpresponse-subclasses) of the Django framework, I needed to not fire on usage of the vanilla Python HttpResponse. Semgrep module resolution lets me do this very easily.

return django.http.HttpResponse(...)

Custom post-analysis filtering

Another great feature I like about Semgrep is that, after doing my AST-based analysis, I like to hone in my analysis based on certain captured metavariables. This is very useful for the types of analysis where I have some whitelist or blacklisting logic of strings or other literal values.

The following is an example rule that takes advantage of post-analysis filtering.

patterns:
  - pattern-either:
      - pattern: |
          rsa.GenerateKey(..., $BITS)
      - pattern: |
          rsa.GenerateMultiPrimeKey(..., $BITS)
  - pattern-where-python: |
      int(vars['$BITS']) < 2048

Conclusion

Having written all my analysis in Flake8, ESLint, and Semgrep, the amount of time Semgrep save me is very significant. There’s no obvious degradation with quality of analysis I can write and the features built into Semgrep only amplifies what I can express with my simple patterns. As a bonus, prototyping rules against real code using semgrep.live is very robust and functions like an IDE, which is a much better experience compared to https://astexplorer.net/ or https://python-ast-explorer.com/. Overall, without any bias or contention, I don’t want to go back to writing AST-based visitors now that I’ve found Semgrep.

Preventing SQL injection: a Django author's perspective

Pablo Estrada — Tue, 12 May 2020 21:42:44 +0000

This is a guest post co-authored by Jacob Kaplan-Moss, co-creator of Django, and Grayson Hardaway.

What’s SQL Injection?

SQL Injection (SQLi) is one of the most dangerous classes of web vulnerabilities. Thankfully, it’s becoming increasingly rare — thanks mostly to increasing use of database abstraction layers like Django’s ORM — but where it occurs it can be devastating.

SQLi happens when code incorrectly constructs SQL queries that contain user input. For example, imagine writing a search function without knowing about SQLi:


def search(request):
    query = request.GET['q']
    sql = f"SELECT * FROM some_table WHERE title LIKE '%{query}%';"

    cursor = db.cursor()
    cursor.execute(sql)
    ...

Can you spot the problem? Notice that the query comes from the browser: request.GET['q']. Think about what might happen if that query contains a single quote. What happens when the SQL string is constructed?

Consider if an attacker searches for ' OR 'a'='a. In this case the constructed SQL would become:

SELECT * FROM some_table WHERE title LIKE '%%' OR 'a'='a';

So that’s bad; now we’re returning the entire contents of the table. This could be a data breach, or it could overwhelm your database server.

But it gets worse; imagine now that the attacker searches for '; DELETE FROM some_table. Now, the constructed SQL becomes:

SELECT * FROM some_table WHERE title LIKE '%%'; DELETE FROM some_table;

Uh oh.

General concepts for preventing SQLi

We’ll get to Django specifics shortly, but first it’s important to really understand the fundamental rules of preventing SQL injection:

Never trust any data submitted by the user
Always use "parameterized statements" when directly constructing SQL queries

Anything that comes from the user could be maliciously constructed. Even things that seem safe, like browser headers (e.g., things like the user agent, request.META['HTTP_USER_AGENT'] in Django) are trivial to tamper with either directly in the browser or with tools like Burp or Charles.

Practically, in Django this means nearly anything that hangs off the HttpRequest object, i.e., the request parameter that’s passed as the first argument to view functions. Though there are some exceptions, it’s probably best to consider anything on request as fundamentally untrustworthy.

However, just because some piece of data isn’t attached to request right now doesn't mean that you can trust it. For example, consider something like an image caption. You might access it through an API that doesn’t mention a request:

image = Image.objects.get(...)
sql = f"""SELECT * FROM images WHERE similarity(caption, '{image.caption}') > 0.5;
...

But if that image caption was previously entered by a user…it’s still dangerous. So this brings us around to the second rule: always use parameterized statements.

Parameterized statements are a mechanism to pass any dynamic parameters separate from the SQL query. They’re either interpreted directly by the database or safely escaped before being added to the query. Almost every database client on the planet supports parameterized statements — and if yours doesn’t, find a different one.

Here’s what the search function from above would look like with parameterized statements:

def search(request):
    cursor = db.cursor()
    cursor.execute(
        "SELECT * FROM some_table WHERE title LIKE '%?%'",
        [request.GET['q']]
    )

Notice the ? in the SQL string, and the second parameter to execute. This second argument is the parameter list; items in this list are safely injected into the query to replace the question marks.

PEP-249, the Python database API standard requires parameterized statements, though different libraries may use different syntax for the placeholders (%-style parameters, :named parameters, numeric parameters, etc.).

You can use code analysis tools to check for SQL injections. Bento is one such tool that has several checks for common SQL injection problems. This can catch many common errors; but it’s still a best practice to use parameterized statements and one of the techniques below to completely prevent this attack.

Preventing SQLi in Django

Django’s ORM uses parameterized statements everywhere, so it is highly resistant to SQLi. Thus, if you’re using the ORM to make database queries you can be fairly confident that your app is safe.

However, there are still a few cases where you need to be aware of injection attacks; a very small minority of APIs are not 100% safe. These are where you should focus your auditing, and where your automated code analysis should focus its checks.

Raw Queries

Occasionally, the ORM isn’t expressive enough, and you need raw SQL. Before you do, consider whether there are ways to avoid it -- for example, building a Django model on top of a database view, or calling a stored procedure can help prevent the need to embed raw SQL in your Python.

But, sometimes raw SQL is unavoidable. There are several APIs for doing this, but all are somewhat dangerous. In order of desirability, these are the APIs that Django provides:

Raw queries, for example:

    sql = "... some complex SQL query here ..."
    qs = MyModel.objects.raw(sql, [param1, param2])
    # ^ note the parameterized statements in the line above

The RawSQL annotation, for example:

    from django.db.models.expressions import RawSQL

    sql = "... some complex subquery here ..."
    qs = MyModel.objects.annotate(val=RawSQL(sql, [param1]))
    # ^ note the parameterized statement in the line above

Use database cursors directly, for example:

    from django.db import connection
    sql = "... some complex query here ..."
    with connection.cursor() as cursor:
     cursor.execute(sql, [param1])
     # ^ again, note the parameterized statement in the line above

AVOID: Queryset.extra() (no example: this is unsafe, so it's just included for completeness).

To use these APIs safely:

Read the first part of this article and make sure you understand parameterized statements before proceeding.
Don’t use extra(). It’s difficult (if not impossible) to use in a way that’s 100% safe, and should be considered deprecated.
Always pass parameterized statements — even if your parameter list is empty. That is, you should write something like:

    sql = 'SELECT * FROM something;'
    qs = MyModel.objects.raw(sql, [])

This is to remind you to later add parameters to this list, and to make it easier for automated tools like Bento to find potentially incorrect API usage.

The query itself should always be a static string, rather than one formed from concatenation or any other string processing. Again, this is to make it easier for automated tools to find incorrect API usage.

Automatic Prevention

It is good practice to use code analysis tools to catch preventable mistakes — to err is human, as the saying goes. Bento will automatically check Django code for SQL injection patterns. The following will check your codebase all at once for SQL injections caused by something hanging off of the request object.

pip3 install bento-cli && \
  bento init && \
  BENTO_REGISTRY=r/r2c.python.django.security.injection.sql bento check -t semgrep --all .

Better than checking your current code, however, is checking your future code! Bento is designed to be run as a pre-commit hook or in continuous integration (CI) environments. Bento is diff-aware and will only check commits, ensuring a speedy workflow while keeping your code secure. When you init Bento on your project, it will automatically set itself up to check commits.

This commit-based workflow is especially powerful for ensuring certain patterns never enter your codebase. To practically eliminate SQL injection from your codebase, you should automatically detect that your code:

Always uses parameterized queries.
Never uses .extra().

Bento can detect these patterns by using a different registry:

BENTO_REGISTRY=r/r2c.python.django.security.audit bento check -t semgrep --all .

This set of rules will highlight many more findings even when there is not a vulnerability. It is much stricter and can be overwhelming if you check your code all at once. However, you can also archive your findings with Bento, which will suppress findings until you’re ready to deal with them. This lets you continuously check your code for these patterns without being overwhelmed by findings.

Under the hood, Bento is powered by Semgrep. Semgrep is a tool for easily detecting and preventing bugs and anti-patterns in your codebase. It combines the convenience of grep with the correctness of syntactical and semantic search. This has advantages over normal grep — the most obvious one being that Semgrep is not thwarted by line boundaries.

Let’s say you wanted to detect the following SQL injection:

def search(request):
    search_term = request.GET['search_term']
    cursor = db.cursor()
    cursor.execute("SELECT \* FROM table WHERE field=" + search_term)

This can be expressed in Semgrep like so:

$VAR = request.GET[...]
...
$CUR.execute("..." + \$VAR)

Detecting a pattern like this in a commit-based workflow is invaluable because it effectively eliminates this pattern of SQL injection from your codebase! You can check this out in action at https://sgrep.live/0X5.

Other ORMs

Finally, if you are continually finding that Django’s ORM isn't expressive enough, you may want to experiment with replacing Django’s ORM with SQLAlchemy, which is a more powerful and expressive ORM. You’ll lose out on many of Django’s conveniences like the admin, model forms, and model-based generic views, but will gain a more powerful and expressive API that’s still safe.

Custom ORM additions

Finally, there are a few potentially-dangerous areas that may be unsafe even though you’re not using raw SQL directly. Django allows for the creation of custom aggregates and custom expressions -- e.g. a third-party library could write APIs such that something like Document.objects.filter(title__similar_to=other_title) would work.

Django’s core ORM -- the core expressions, annotations, and aggregations -- are all mature and battle-hardened. The odds of a SQLi in the core parts of the ORM is very, very low. But ORM additions -- especially ones that you write yourself -- can still be a source of risk.

To mitigate the risk of injection from these advanced features, I suggest the following:

First, be cautious about including custom expressions/aggregates from third-party apps. You should audit those third-party apps carefully. Is the app mature, stable, and maintained? Are you confidant that any security issues would be promptly fixed and responsibly disclosed? And, of course, be sure to pin your dependencies to prevent newer and potentially less secure versions from being installed without your explicit direction.

Similarly, be cautious about writing your own custom aggregates. Carefully read the beginning of this article, and Django's documentation about avoiding SQL injection in custom expressions. As the documentation shows, if possible you should avoid doing any string interpolation in custom expressions. If you can’t, you'll need to escape any expression parameters yourself. This is tricky to do right, and will depend on the specifics of your database engine and Python wrapper API. Consult an expert before diving in here!

The django.security.audit registry in Bento will detect if a custom ORM addition is defined in your codebase; you could also quickly audit third-party apps with this. The exploitation conditions are very nuanced, so if you find this in your project, be sure to consult that expert!

Wrapping up

Django was designed to be resilient against SQL injection (and other common web vulnerabilities). Most common uses of Django will be automatically protected, so SQLi vulnerabilities in real-world Django apps are thankfully rare.

However, when they occur, SQLi vulnerabilities are devastating. It’s well worth your time to audit your codebase to ensure you’re safe. Bento can help by flagging several common vulnerabilities. Now that you understand the concepts, and why certain errors are flagged, you should be better equipped to write safe code.

Our quest to make world-class security and bugfinding available to all developers, for free

Pablo Estrada — Wed, 05 Feb 2020 17:40:37 +0000

by Isaac Evans, CEO and co-founder @ r2c

This post was originally published on the Bento blog in late December 2019.

Why we’re building

One thing we’ve learned at r2c is that most Python or JavaScript developers have never heard of—let alone tried—the tools some devs use to find deep flaws in code: like Codenomicon, which found Heartbleed, or Zoncolan at Facebook, which finds more top-severity security issues than any human effort. Not only do these tools find severe issues, they save time by pointing out hundreds of thousands of issues before humans do.

We believe every developer deserves access to powerful tools, but most don’t know about or can’t afford them. r2c’s mission is to make those tools available to those who want to find bugs, discover security problems, and save time but don’t work for a giant company that prioritizes these problems with nearly unlimited resources.

That’s why we’re excited to release Bento! It’s a free and opinionated toolkit for easily adopting linters and program analysis in a codebase. It includes analysis we’ve written and packages fantastic community-created tools, all running offline (no code is ever shipped off your machine). Over the next few months we’ll release more novel checks and include existing tools; subscribe for updates.

Some members of our team wrote early versions of these tools at places like Facebook. r2c started by building infrastructure to make it easy to run static analysis tools at massive scale (see our paper co-published at USENIX) but our goal has always been to take the learnings from scaling analysis to benefit individual developers directly: folks helping small teams writing voter registration systems for their city, non-profits who serve communities targeted by powerful hostile actors, startups who handle sensitive data about fellow humans, or developers who just want to automate away code review.

How can I get Bento now?

Bento is in alpha, but you can try it right away:

pip3 install bento-cli

Here’s a short demo:
youtube: https://www.youtube.com/embed/rGwd1aEF8Yk

A lot of love from our small team has gone into Bento. Please try it on your Python or JavaScript projects and send us feedback!

But is this just a glorified linter?

Well yes, but actually, no; Bento is currently a union of curated AST-based lints, including new ones written by us, tuned to find bugs that matter. Our roadmap takes us far beyond AST-based linting though: finding sql injection through taint analysis, detecting dangerous dependency upgrades, etc.

Linters have done a good job reaching developers and improving code consistency, especially style. But we want to surface issues and checks that are deep and avoid arguing about spaces vs tabs in code review. Bento ships with configurations that are tuned on real-world data and focuses the finding on correctness and security. They are based on using our platform to analyze swathes of open-source repositories and see what checks developers turn on and off (Three Things Your Linter Shouldn’t Tell You. Our opinion is that you should forget about style and use a deterministic, zero-config formatter (Black for Python or Prettier for JavaScript).

As opposed to other tools that try to measure code-quality or concatenate linter output, we have skin in the analysis game; we’re already making some contributions back to the tools we include. We’re collaborating with a few linter authors already and we would love to offer free compute resources on our platform for measuring check quality to anyone else who might be interested (hello@r2c.dev).

Here’s what’s coming next

Our immediate focus is writing custom analysis tools to find security and other issues for users of the Flask web framework. If you or someone you know uses Flask and has ideas on what we might detect, send us a note or make an issue!

Bento core values

Our first releases are about making it easy to install, adopt, and get started before we ship everything on our roadmap.

Find bugs that matter
Bento automatically enables and configures relevant analysis based on your dependencies and frameworks, and it will never report style-related issues. You won’t painstakingly configure your tooling, we did that already!

Go fast
No one should have to dig through thousands of linter results and fix them before they can start using a tool. Bento ships with a built-in archiving feature that lets you establish a baseline without fixing all the issues at once and just look at any new problems entering the codebase.

This philosophy also applies to setup: Bento auto-configures in about 30 seconds, it’s easy to install in a Docker container, and it can even install itself as a pre-commit hook automatically.

Get better over time
Bento automatically tailors itself to your project by enabling checks that correspond to your language, framework, and dependencies. As time goes on and based on community feedback, we’ll be writing and shipping new checks that you can adopt automatically. And we want your feedback!