Obfuscation is the process of deliberately making information more complex and harder to understand. Deobfuscation, on the other hand, is the method of reversing obfuscation to reveal the original information.
Since web browsers have to expose webpages' source code, developers usually use different obfuscation techniques to make their code harder to read. This, in turn, makes life harder for web scrapers trying to pull useful content from such web pages.
In this article, you’ll learn about obfuscating and deobfuscating web pages and practical tips for scraping data from obfuscated content.
How Web Pages are Obfuscated
Common methods for obfuscating web pages include:
CSS Obfuscation
Modern front-end build tools like Webpack and CSS modules provide features to obfuscate class names and IDs in HTML. They achieve this by generating unique or hashed class names that are difficult to predict. For example, a simple CSS class-based styling might look like this before obfuscation:
.button {
color: blue;
}
After obfuscation with Webpack, the class might look like this:
._3iJ7sK {
color: blue;
}
Now, in the HTML, a button that previously used class="button"
might use class="_3iJ7sK"
, which limits scrapers that rely on known class names to locate elements.
JavaScript Obfuscation with btoa
JavaScript also has a native btoa()
function that converts a string into a base64-encoded string. While it’s a simple form of obfuscation, it does make raw text less readable. For example:
const originalString = "Hello, World!";
const encodedString = btoa(originalString);
console.log(encodedString); // Outputs: "SGVsbG8sIFdvcmxkIQ=="
This technique is mostly used in HTML attributes, where encoded data is stored directly. For example, instead of seeing plain text in a data-*
attribute, you might find something like data-value="U29tZSBzZWNyZXQgdGV4dA=="
. Here, btoa has been used to encode "Some secret text," which makes it less obvious to identify at a glance.
Deobfuscation Techniques in Web Scraping
There are multiple strategies for scraping obfuscated web pages. Some common methods I've found useful include:
CSS Selectors with Substring Matching
When CSS classes are obfuscated, matching the exact class name can be tricky, as they may look like random strings that change frequently. To handle this, you can use substring matching in your CSS selectors. This way, you can match any class name containing a specific pattern, even if the entire name isn’t consistent.
For example, let’s say you’re scraping a site where the "buy" button has a class name that always includes the substring _buyBtn_
, but the full class name might look something like _buyBtn_3iJ7sK
or _buyBtn_x9L2kT
. Instead of trying to match the whole class name exactly, you can use:
[class*="_buyBtn_"]
This selector matches any class containing _buyBtn_
. You can implement it in your scraper, as shown in the Node.js and Cheerio example below.
const axios = require("axios");
const cheerio = require("cheerio");
// Sample URL
const url = "http://example.com";
axios.get(url).then((response) => {
const $ = cheerio.load(response.data);
// Find the button using substring matching
const buyButton = $('[class*="_buyBtn_"]');
// Output the text inside the matched button element
console.log(buyButton.text());
}).catch(error => {
console.error("Error fetching the page:", error);
});
This way, you can focus on a stable part of the class name (buyBtn) and reliably locate the necessary elements, even with obfuscation.
XPath with Wildcards
XPath is another way to locate elements in HTML, similar to CSS substring matching; however, XPath allows for more complex logic. You can search for elements based on attributes, match elements by text content, navigate the document structure, and much more.
For example, if you’re targeting an element with an obfuscated class attribute, you can use a wildcard like this:
//div[contains(@class, '_3iJ7sK')]
This XPath expression matches any <div>
element where the class attribute includes _3iJ7sK
. You can then implement the expression in a puppeteer and Node.js scraper, as shown below.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Sample URL
const url = "http://example.com";
await page.goto(url);
// Find a div element with a class containing "_3iJ7sK"
const [targetElement] = await page.$x("//div[contains(@class, '_3iJ7sK')]");
if (targetElement) {
// Extract and print the text content
const text = await page.evaluate(element => element.textContent, targetElement);
console.log(text);
} else {
console.log("Element not found");
}
await browser.close();
})();
As mentioned earlier, XPath can also select elements based on any attribute, not just class:
// Searches for a button element with type='submit'
page.$x("//button[@type='submit']")
Select elements based on their text content.
// Selects an <a> tag with the exact text 'Click here'
page.$x("a[text()='Click here']")
XPath’s advanced logic makes it useful for scraping complex HTML structures with a level of precision that CSS selectors don't provide.
Decoding Base64 with atob
If you come across content encoded with btoa()
, you can easily decode it using the atob()
function, as shown below:
const encodedString = "SGVsbG8sIFdvcmxkIQ==";
const decodedString = atob(encodedString);
console.log(decodedString); // Outputs: "Hello, World!"
While this might not be useful for scraping selectors, it can help you better understand the targeted content’s markup structure, especially in data attributes.
Conclusion
Obfuscation is becoming increasingly common in web development, creating challenges for web scrapers that rely on structured data. This tutorial explored how these obfuscation techniques are implemented and provided methods to help overcome them.
Cover photo by Chris Stein on Unsplash
Top comments (0)