DEV Community

Cover image for DeepCode’s Top Findings#5: JavaScript Unsanitized Input is used to build RegEx
cu_0xff 🇪🇺 for DeepCode.AI

Posted on • Originally published at Medium

DeepCode’s Top Findings#5: JavaScript Unsanitized Input is used to build RegEx

Language: JavaScript
Defect: Input Data/Sanitization (Category Security 1)
Diagnose: Unsanitized user input flows from the document location and is used to build a regular expression in RegExp. This may result in a Regular expression Denial of Service attack (reDOS).

We find this bug in three.js - you can use the dashboard in deepcode.ai to load mrdoob/three.js and follow along. I am using this example as I do not want to expose a real vulnerability but still make my point. As this here is a local script, it should only affect the attacker's machine.

Background

We will see two interesting things here: (1) How DeepCode uses a data flow analysis to perform what is called a taint analysis and (2) how it reasons about its findings.

Ok, let us start. We have a function that extracts and trims the part of the request URL which includes a question mark to signal a query (line 52 to 64 of the source file).

...
function extractQuery() {

            var p = window.location.search.indexOf( '?q=' );

            if ( p !== - 1 ) {

                return window.location.search.substr( 3 );

            }

            return '';

        }
...

So, a query like https://my.web.server/site/page?q=/^(a|aa)+$/ could result in /^(a|aa)+$/ (query strings are very forgiving ) as the result of calling this function. We see why this is of importance in a bit.

The result of this function will be whatever comes from the user. DeepCode infers that we have incoming data directly from the user in an unsanitized way. By this we mean the data was not transformed to ensure trustworthiness (e.g., encoding, escaping). Closely related is validation where data that does not fit into the desired pattern is rejected. So this data is treated as tainted. How does the DeepCode engine infer this is incoming external user data? We use machine learning algorithms applied to thousands of open source projects to identify sources (API functions) of user data. In the same vein, DeepCode learns how such data is typically sanitized before it can be used safely.

In line 213, it continues...

        function updateFilter() {

            var v = filterInput.value;

            ...

            var exp = new RegExp( v, 'gi' );

            for ( var key in files ) {

                var section = files[ key ];

                for ( var i = 0; i < section.length; i ++ ) {

                    filterExample( section[ i ], exp );

                }

            }
            ...
        }

As you can see, here the property value of the global filterInput is assigned to a local variable v, which is then used to generate a new regular expression that is then used to provide a selection within an array. So far, so innocent.

In line 318, we find...

    ...
    filterInput.value = extractQuery();
        ...
        updateFilter();
    ...

And here it finally all comes together. The tainted data as a result of extractQuery() gets assigned to a property of a complex type which is then used inside the function that is called later without prior sanitization. So, finally tainted data gets used to generate a regular expression. An attacker could provide what is called an evil regex which can get stuck (aka Denial of Service by overloading the machine). /^(a|aa)+$/g (remember our example from above?) is such a regex element for the input aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa! - try it yourself on regexr.com. The necessary number of as depends on your machine. Play around and see that the runtime roughly doubles with each additional as until it reaches 250ms where regexr cuts off.

DeepCode argues correctly and provides the argument in a sentence:

Unsanitized user input flows from the document location and is used to build a regular expression in RegExp. This may result in a Regular expression Denial of Service attack (reDOS).

DeepCode Output

Note: The bold parts can be hovered over in the dashboard which will set the focus of the window on the object in question.

Knowing the source (here window.location) and the data sink (here RegEx), DeepCode formulates the correct argumentation. Finally, DeepCode provides you a more info link (here OWASP - Regular expression Denial of Service ) and three examples from other open-source projects to show how those treated the same type of issue.

Using our machine learning approach, the system classifies sources of tainted data, possible sanitization functions, and data sinks - in this case, generators of regular expressions. DeepCode is able to track tainted data flowing through your application even in cases like this where the literal wanders through several variables.

Secure-coding conventions require to sanitize all user input before using it. There are existing and proven sanitization libraries available (available npm packages as an example).

CU

0xff

PS: A paper on taint analysis by colleagues from DeepCode
PPS: An earlier article on a similar issue by Victor Chibotaru

Top comments (0)