How to reliably extract data from Instagram (and similar sites) by intercepting network calls in a Chrome extension—robust, minimal, and production-ready.
Why Network Interception?
Traditional DOM scraping is fragile: UI changes, dynamic loading, and anti-bot measures make it unreliable. Instead, intercepting network calls lets you capture the raw data the site uses—before it even hits the DOM.
Advantages:
- Immune to UI/layout changes
- Access to complete, structured data
- Faster (no DOM parsing)
- Stealthier (less likely to trigger anti-bot)
- Works for any site using XHR/fetch APIs
Architecture Overview
┌─ Page Context (Script Injection) ─┐
│ Patch fetch/XHR, intercept API │
│ Dispatch CustomEvent w/ data │
└───────────────────────────────────┘
│
▼
┌─ Content Script ──────────────────┐
│ Listen for events, forward data │
│ to background via messaging │
└───────────────────────────────────┘
│
▼
┌─ Background Script ───────────────┐
│ Process, deduplicate, store, │
│ or forward to external API │
└───────────────────────────────────┘
Minimal, Robust Implementation
1. Manifest Setup (MV3)
{
"manifest_version": 3,
"name": "Instagram Network Interceptor",
"version": "1.0.0",
"permissions": ["scripting", "storage"],
"host_permissions": ["https://*.instagram.com/*"],
"background": { "service_worker": "background.js" },
"content_scripts": [
{
"matches": ["https://*.instagram.com/*"],
"js": ["content.js"],
"run_at": "document_start"
}
],
"web_accessible_resources": [
{
"resources": ["public/page-interceptor.js"],
"matches": ["<all_urls>"]
}
]
}
2. Page Interceptor (public/page-interceptor.js)
Runs in the page context, patches fetch/XHR, and dispatches a CustomEvent with the response data.
(function () {
if (window.__instaInterceptor) return;
window.__instaInterceptor = true;
const originalFetch = window.fetch;
window.fetch = async function (input, init) {
const resp = await originalFetch.apply(this, arguments);
try {
const url = typeof input === 'string' ? input : input.url || '';
if (url.includes('/graphql') || url.includes('/api/')) {
const cloned = resp.clone();
const text = await cloned.text();
window.dispatchEvent(new CustomEvent('instaApi', {
detail: { url, body: text, status: resp.status, method: init?.method || 'GET' }
}));
}
} catch (err) { /* ignore */ }
return resp;
};
const originalXHROpen = XMLHttpRequest.prototype.open;
const originalXHRSend = XMLHttpRequest.prototype.send;
XMLHttpRequest.prototype.open = function (method, url) {
this._method = method;
this._url = url;
return originalXHROpen.apply(this, arguments);
};
XMLHttpRequest.prototype.send = function () {
this.addEventListener('load', function () {
if (this._url && this._url.includes('/api/')) {
window.dispatchEvent(new CustomEvent('instaApi', {
detail: { url: this._url, body: this.responseText, status: this.status, method: this._method }
}));
}
});
return originalXHRSend.apply(this, arguments);
};
})();
3. Content Script (content.js)
Injects the interceptor and relays events to the background script.
// Inject the page script
const url = chrome.runtime.getURL('public/page-interceptor.js');
const script = document.createElement('script');
script.src = url;
script.type = 'text/javascript';
document.head.appendChild(script);
// Listen for intercepted data
window.addEventListener('instaApi', (evt) => {
chrome.runtime.sendMessage({ type: 'API_PAYLOAD', payload: evt.detail });
});
4. Background Script (background.js)
Processes, deduplicates, and stores or forwards the data.
const processedShortcodes = new Set();
chrome.runtime.onMessage.addListener((msg, sender) => {
if (msg.type === 'API_PAYLOAD') {
try {
const json = JSON.parse(msg.payload.body);
// Example: extract Instagram post shortcode
const shortcode = extractShortcode(json);
if (!processedShortcodes.has(shortcode)) {
processedShortcodes.add(shortcode);
// Store, process, or forward to external API here
chrome.storage.local.set({ [Date.now()]: json });
}
} catch (e) {
// Non-JSON or irrelevant response
}
}
});
function extractShortcode(json) {
// Minimal example: adapt to your API structure
return json?.data?.shortcode || '';
}
Best Practices & Gotchas
- Always clone() responses before reading body.
- Deduplicate by post ID/shortcode to avoid reprocessing.
- Rate limit if forwarding to external APIs.
- Handle CSP by injecting a file, not inline code.
- Respect privacy and ToS: only process public data, implement delays, and avoid logging sensitive info.
Extending to Other Platforms
You can adapt the same pattern for Twitter, LinkedIn, etc., by changing the URL detection logic in the interceptor.
const isTargetApi = (url) =>
url.includes('instagram.com/graphql') ||
url.includes('twitter.com/i/api') ||
url.includes('linkedin.com/voyager');
Conclusion
Network interception in the page context is the most robust, future-proof way to extract data from modern web apps. With a minimal Chrome extension, you can capture structured data directly from the source—no more brittle DOM scraping.
Top comments (0)