<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Harshith Reddy Nalla</title>
    <description>The latest articles on DEV Community by Harshith Reddy Nalla (@harshithreddy01).</description>
    <link>https://dev.to/harshithreddy01</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3866716%2F940fb980-887c-45ef-8799-3b03dfc1b205.png</url>
      <title>DEV Community: Harshith Reddy Nalla</title>
      <link>https://dev.to/harshithreddy01</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/harshithreddy01"/>
    <language>en</language>
    <item>
      <title>I scanned 20 popular Python packages for dangerous regex patterns. Here is what I found.</title>
      <dc:creator>Harshith Reddy Nalla</dc:creator>
      <pubDate>Wed, 08 Apr 2026 01:43:13 +0000</pubDate>
      <link>https://dev.to/harshithreddy01/i-scanned-20-popular-python-packages-for-dangerous-regex-patterns-here-is-what-i-found-dg6</link>
      <guid>https://dev.to/harshithreddy01/i-scanned-20-popular-python-packages-for-dangerous-regex-patterns-here-is-what-i-found-dg6</guid>
      <description>&lt;p&gt;At 13:42 UTC on July 2, 2019, an engineer working for Cloudflare made changes to the regular ruleset that was being used by their Web Application Firewall. In under three minutes, there was an 80% drop in the amount of traffic globally. The load on all HTTP serving CPUs in their network hit 100%. It was caused by one regular expression intended to detect XSS attacks, which contained the regular expression pattern &lt;code&gt;.*(?:.*=.*)&lt;/code&gt;. This pattern included two quantifiers using &lt;code&gt;.*&lt;/code&gt; on the same character class.&lt;/p&gt;

&lt;p&gt;That was the result of a production ReDoS.&lt;/p&gt;

&lt;p&gt;I was interested to know how frequent such patterns are in Python libraries that we use everyday.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is actually happening
&lt;/h2&gt;

&lt;p&gt;The way regular expressions work is that they try to find all the ways to match a pattern against a string. This works just fine in general since the matcher would either get a match, or rule out some ways very soon. The trouble arises when you design a pattern that could allow input characters to be consumed by different parts of your pattern in various ways. In the case of a failure, the matcher has to try all possibilities.&lt;/p&gt;

&lt;p&gt;It’s not linear growth though. This is actual data from our test harness running against the regular expression &lt;code&gt;(a+)+&lt;/code&gt; when used on a non-matching string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;n=10   -&amp;gt;  0.001s
n=20   -&amp;gt;  0.884s   (884x slower than n=10)
n=30   -&amp;gt;  hours
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the input doubles in size, it takes exponentially longer — 884 times slower at n=20 versus n=10. At n=30, the trend hasn’t slowed, but it has stopped. If there was an endpoint using this pattern on the user input, it would be a denial of service waiting to happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  How redos-analyzer works
&lt;/h2&gt;

&lt;p&gt;However, almost all of the existing checkers perform tests on patterns via feeding them malicious inputs and timing how long it takes. While this may work just fine, such an approach requires malicious inputs to be created specifically for the pattern in question, and this task may sometimes prove challenging.&lt;/p&gt;

&lt;p&gt;Instead, our solution uses the fact that when Python parses the regex code to compile it, it first creates its abstract syntax tree by invoking the &lt;code&gt;sre_parse&lt;/code&gt; internal module. The resulting abstract syntax tree consists of &lt;code&gt;(opcode, value)&lt;/code&gt; tuples representing the pattern in a structured form. If an abstract syntax tree features a &lt;code&gt;MAX_REPEAT&lt;/code&gt; node containing a &lt;code&gt;SUBPATTERN&lt;/code&gt; that itself contains a &lt;code&gt;MAX_REPEAT&lt;/code&gt;, the nested quantification is detected independently of any input.&lt;/p&gt;

&lt;p&gt;This allows us to statically analyze the abstract syntax tree and detect four categories of potentially dangerous patterns: nested quantifications, nulls within quantifications, overlapping alternatives, and one more category of our own that was unexpected.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we found across 20 packages
&lt;/h2&gt;

&lt;p&gt;The source code distribution was analyzed for 20 of the most popular Python libraries: requests, flask, django, fastapi, sqlalchemy, pydantic, pytest, numpy, pandas, pillow, scrapy, celery, boto3, httpx, aiohttp, click, rich, typer, black, and mypy. The parser traverses all files ending with .py, extracts all calls to re.compile and re.search, and executes the analysis pipeline for each of them.&lt;/p&gt;

&lt;p&gt;90 initial warnings were reported. After excluding test code, third-party libraries, and build tools, 23 warnings remained in runtime code.&lt;/p&gt;

&lt;p&gt;Of the findings that were identified, the aiohttp one was probably the most interesting to explore further. The tool detected &lt;code&gt;_WS_EXT_RE&lt;/code&gt; as being potentially dangerous as it is an example of problematic regex usage. Specifically, it is used within the parser for the WebSocket extension header where the code uses &lt;code&gt;headers.get()&lt;/code&gt;, which makes it process header values in general. When measured with adversarial inputs, it takes about 0.8ms at its worst under CPython 3.12, making it well below any realistic time threshold. While the maintainer acknowledged that it might be inefficient, there was no security hole involved. However, what could be established from the audit engagement was that a previous audit found &lt;code&gt;_COOKIE_PATTERN&lt;/code&gt; (in PR #11900) to be problematic, indicating that aiohttp recognized this type of flaw in the past.&lt;/p&gt;

&lt;p&gt;Another interesting result was that in &lt;code&gt;pytest&lt;/code&gt;, in the file &lt;code&gt;expression.py&lt;/code&gt; line 113, the pattern has &lt;code&gt;(:?\w|:|...)&lt;/code&gt;. In all likelihood, the intended regex was &lt;code&gt;(?:...)&lt;/code&gt; - a non-capturing group. However, what is written there is a capturing group that matches a colon optionally. The reason why the &lt;code&gt;LIKELY_TYPO&lt;/code&gt; detector spotted this is because &lt;code&gt;(:?&lt;/code&gt; is a valid syntax although practically never used intentionally.&lt;/p&gt;

&lt;h2&gt;
  
  
  The automatic fix
&lt;/h2&gt;

&lt;p&gt;Each structure is accompanied by a corrected version of its pattern based on atomic groups, which are included in the &lt;code&gt;re&lt;/code&gt; library of Python starting from version 3.11.&lt;/p&gt;

&lt;p&gt;Atomic groups &lt;code&gt;(?&amp;gt;...)&lt;/code&gt; represent a kind of one-way door: once the pattern engine enters it, no other way out than to fail there is possible, even if it becomes clear further on that the subsequent pattern cannot be found. The reason for this is that there is no combinatorial explosion with atomic groups since no backtracking is allowed in them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BEFORE   (a+)+          dangerous
AFTER    (?&amp;gt;a+)+         safe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In an experiment using the Colab environment where n=22, the fixed pattern ran in 0.000111 seconds while the original took 0.306 seconds. The tool is also able to perform a semantic equivalency check to make sure the new code matches the old one.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;redos-analyzer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will also work on Google Colab using &lt;code&gt;!pip install redos-analyzer&lt;/code&gt;. The fundamental API is simply two methods: the first one, called &lt;code&gt;analyze()&lt;/code&gt;, provides a warning message in form of list of warnings about pattern usage, while the second method, &lt;code&gt;suggest_fix()&lt;/code&gt;, suggests fixes to the problematic regex based on the given pattern and warning message.&lt;/p&gt;

&lt;p&gt;Source code for the package can be found at &lt;a href="https://github.com/HarshithReddy01/redos-analyzer" rel="noopener noreferrer"&gt;https://github.com/HarshithReddy01/redos-analyzer&lt;/a&gt; and Zenodo citation at &lt;a href="https://doi.org/10.5281/zenodo.19462441" rel="noopener noreferrer"&gt;https://doi.org/10.5281/zenodo.19462441&lt;/a&gt;. A reproducible notebook with all timing experiments from this post is at &lt;a href="https://github.com/HarshithReddy01/Algorithms-Practice/blob/master/Redosprojectlegit.ipynb" rel="noopener noreferrer"&gt;https://github.com/HarshithReddy01/Algorithms-Practice/blob/master/Redosprojectlegit.ipynb&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Further steps include extending the analysis by scanning through files looking for such situations, when pattern is defined somewhere in the codebase and then used in some other file (like aiohttp library case). In its current implementation with 8-line window around every call site, our tool is able to detect simple situations like this one but fails to recognize cross-module dataflow.&lt;/p&gt;

&lt;p&gt;If you detect any problematic pattern usage in your codebase, please create an issue at GitHub.&lt;/p&gt;

</description>
      <category>python</category>
      <category>security</category>
      <category>opensource</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
