DEV Community

Jer Catallo
Jer Catallo

Posted on

OSINT Content Discovery: Why You Need to Know What's Publicly Exposed About Your Web Assets

Passive content discovery helps you map attack surfaces without touching target systems. You can use public search engines, browser extensions, web archives, code repositories, and cloud storage references to find exposed assets. This guide covers five methods you can apply in your own authorized security assessments.

Ethical Considerations

Only use these methods on assets you own or have clear written permission to test.

  • Get written permission before you target any domain, repo, or cloud resource.
  • Follow all laws, platform terms, and bug bounty scope rules.
  • Do not try to access accounts, use found credentials, steal data, or leave backdoors.
  • Stop and report right away if you find sensitive data.
  • Do not proceed if you are not sure about your authorization.

Google Dorking

Google search operators let you filter results to specific domains, file types, URL paths, and page titles. These operators are passive and use only public indexed data.

Step 1: Use site: to Scope Your Search

The site: operator limits results to one domain or hosting platform.

site:<target-domain> "<keyword>"
Enter fullscreen mode Exit fullscreen mode

This query shows only pages from the target domain that contain your keyword. You can use it to find public pages hosted on a specific platform.

You can see indexed GitHub Pages sites that match the keyword. This shows how site: limits search to one hosting domain.

Step 2: Use filetype: to Find Exposed Documents

The filetype: operator filters results by file extension.

"<target-phrase>" filetype:<extension>
Enter fullscreen mode Exit fullscreen mode

This query finds indexed files of a specific type that contain your target phrase. You can use it to map exposed documents and artifacts.

You can see public Jupyter notebooks that may hold code, data samples, or analysis work.

Remediation: Treat found documents as sensitive even if they are public. Do not copy or share private content. Report exposure through approved channels only.

Step 3: Use inurl: for Path-Based Discovery

The inurl: operator targets pages with specific words in the URL path.

inurl:<path-keyword> "<target-phrase>"
Enter fullscreen mode Exit fullscreen mode

This query finds pages with your keyword in the URL path. You can use it to find specific page types.

You can see personal or professional about pages that give more context about the target.

Remediation: Avoid personal targeting, doxxing, or profiling. Collect only the data you need for your security task.

Step 4: Use intitle: for Title-Based Discovery

The intitle: operator matches pages with specific text in the HTML title tag.

intitle:"<title-text>" "<keyword1>" "<keyword2>"
Enter fullscreen mode Exit fullscreen mode

This query finds pages with your text in the title plus extra keywords. You can use it to find project pages tied to certain technologies.

You can see developer portfolio pages that list their technology stack in the page title.

Remediation: Keep searches within approved scope. Do not use findings to target hobby or student projects.

Wappalyzer Technology Fingerprinting

Wappalyzer detects web technologies from the browser. It reads HTTP headers, HTML, JavaScript files, and loaded resources to identify frameworks, CDNs, and services.

Step 5: Fingerprint OWASP Juice Shop Stack

Open the target URL in a browser with Wappalyzer installed.

https://juice-shop.github.io
Enter fullscreen mode Exit fullscreen mode

The extension scans the page and shows detected technologies in its panel.

Wappalyzer found front-end libraries, CDN providers, and hosting indicators on juice-shop.github.io. You can use this stack data to plan your next assessment steps.

Remediation: Only fingerprint where recon is allowed. Do not assume you can attack just because you see stack details. Use this data for defensive testing.

Step 6: Analyze GitHub Technology Profile

Apply Wappalyzer to large platforms to see their technology footprint.

https://github.com
Enter fullscreen mode Exit fullscreen mode

The extension finds frameworks, analytics tools, CDN providers, and cloud services.

Wappalyzer found React, React Router, GSAP, AWS-related services, and more on github.com. This shows how fingerprinting works on large applications.

Remediation: Follow platform terms and rate limits. Do not scrape data in an abusive way. Use collected data only for authorized tasks.

Wayback Machine Archive Analysis

The Wayback Machine stores historical snapshots of web pages. You can use it to find old URLs, retired endpoints, and content versions no longer on the live site.

Step 7: Search for Historical Snapshots

Enter the target domain into the Wayback Machine search.

https://web.archive.org/web/*/<target-domain>
Enter fullscreen mode Exit fullscreen mode

Browse the calendar timeline to see archived snapshots from different dates.

You can use this as your entry point for historical content analysis. You can find old URLs, retired endpoints, and content versions that are no longer on the live site.

Remediation: Just because content is archived does not mean you can test current systems. Do not use archived findings to access restricted areas without approval. Check ownership and scope before you test any found endpoint.

GitHub OSINT

GitHub search helps you find code references, config files, and metadata. Public repos often contain clues about infrastructure, dependencies, and potential misconfigurations.

Step 8: Search GitHub for Target Artifacts

Use the GitHub search page with targeted queries.

https://github.com/search?q=<target-query>
Enter fullscreen mode Exit fullscreen mode

You can find repos, code snippets, and config files in your assessment scope.

Step 9: Use GitHub Dork Patterns for Credential Discovery

Organization-scoped searches limit results to one company's public repos.

org:<company-name> <secret-keyword>
Enter fullscreen mode Exit fullscreen mode

Add keywords like "password", "token", "api_key", or "secret" to check for credential exposure.

# =========================================================
# GITHUB OSINT: HIGH-VALUE TARGET DORKS
# =========================================================

# --- Cloud & Infrastructure Secrets ---

# Searches for AWS Access Key IDs within PEM certificate files
"AKIA" extension:pem

# Locates exposed AWS credential configuration files
"AWS_ACCESS_KEY_ID" filename:credentials

# Finds unprotected SSH private keys for server access
"BEGIN OPENSSH PRIVATE KEY" filename:id_rsa

# Discovers Google Cloud Platform (GCP) service account credentials
filename:config "google_application_credentials"


# --- Database & Authentication Leaks ---

# Finds hardcoded MongoDB connection strings in JavaScript files
"mongodb+srv://" extension:js

# Searches for Java/MySQL database connection strings with passwords
"jdbc:mysql://" "password"

# Locates WordPress configuration files containing database passwords
filename:wp-config.php "DB_PASSWORD"

# Finds PostgreSQL password files for local database instances
filename:.pgpass "localhost:5432"


# --- API Keys & Tokens ---

# Hunts for hardcoded Bearer authentication tokens in Python scripts
"authorization: bearer" extension:py

# Locates exposed Django/Python web framework secret keys
filename:settings.py "SECRET_KEY="

# Finds live Stripe payment processing API keys
"api.stripe.com" "sk_live_"

# Discovers exposed Slack webhook URLs
"hooks.slack.com/services/" extension:js


# --- Targeted Corporate Recon ---
# (Replace 'companyname' with your target organization)

# Searches a specific organization's repos for internal Jira passwords
org:companyname "jira_password"

# Finds Atlassian/Confluence access tokens for a specific target domain
"companyname.atlassian.net" "token"

# Locates terminal history files showing SSH connections to a target
filename:.bash_history "ssh user@companyname"

# Discovers internal corporate network routing or configuration files
"corp.companyname.internal" extension:conf
Enter fullscreen mode Exit fullscreen mode

You can see high-value search patterns for cloud credentials, database leaks, and token discovery. The sheet includes org-scoped searches like org:companyname for focused recon.

Remediation: Only use this in authorized training, internal audits, or approved bug bounty scopes. Never use found secrets or credentials. Report exposed credentials through approved incident channels right away.

S3 Bucket Discovery

Amazon S3 buckets often show up in public references through naming patterns, source code, and config files. You can find them using search operators and verify access with AWS CLI.

Step 10: Find S3 Buckets Through Public References

Search for public S3 bucket references with Google dorking.

site:s3.amazonaws.com "<target-company>"
Enter fullscreen mode Exit fullscreen mode

You can also search GitHub for bucket names in source code and config files.

Step 11: Check Bucket Access with AWS CLI

Use the AWS CLI to check if a bucket allows public listing without credentials.

aws s3 ls s3://<bucket-name> --no-sign-request
Enter fullscreen mode Exit fullscreen mode

Get the bucket ACL to see access permissions. A successful response means the bucket allows anonymous access.

aws s3api get-bucket-acl --bucket <bucket-name> --no-sign-request
Enter fullscreen mode Exit fullscreen mode
# =========================================================
# S3 BUCKET OSINT & RECONNAISSANCE
# =========================================================

# --- 1. Passive Discovery (Google & GitHub Dorks) ---

# Google Dork: Finds publicly indexed S3 buckets for a target
site:s3.amazonaws.com intitle:"index of" "companyname"

# Google Dork: Searches for exposed bucket URLs in target's documents
"s3.amazonaws.com" ext:pdf "companyname"

# GitHub Dork: Locates bucket URLs hardcoded in a company's repositories
"s3.amazonaws.com" org:companyname

# GitHub Dork: Finds custom S3 endpoints mapped to a target domain
"companyname.s3.amazonaws.com"


# --- 2. Active Enumeration (Brute-Force Naming Conventions) ---
# Common permutations used in automated tools (e.g., ffuf, Gobuster)
# Format: https://{target}-{keyword}.s3.amazonaws.com

companyname-assets
companyname-public
companyname-private
companyname-dev
companyname-backup
companyname-staging
companyname-prod
companyname-www


# --- 3. Access Verification (AWS CLI) ---
# Testing for insecure permissions (Requires AWS CLI installed)

# Attempt to list the contents of a bucket anonymously (No credentials)
aws s3 ls s3://companyname-assets --no-sign-request

# Attempt to copy a sensitive file from the public bucket to local machine
aws s3 cp s3://companyname-backup/db_dump.sql . --no-sign-request

# Attempt to write a harmless file to test for insecure "Write" permissions
aws s3 cp test_file.txt s3://companyname-public/ --no-sign-request
Enter fullscreen mode Exit fullscreen mode

You can see passive discovery patterns for S3 references using Google and GitHub queries. The image includes naming permutation examples and CLI commands to check bucket permissions.

Remediation: Cloud enumeration needs explicit permission from the asset owner. Do not list, download, upload, or change bucket content unless you have written authorization. If you find an exposed bucket, stop testing and report it with minimal proof.

Summary

You now have five passive content discovery methods you can use in authorized assessments. Google dorking helps you map indexed content with targeted operators. Wappalyzer gives fast technology stack details. The Wayback Machine reveals historical web data and retired endpoints. GitHub OSINT uncovers code references and config metadata. S3 recon shows cloud storage discovery patterns. All methods are passive and you should only use them within authorized scopes with clear permission.


If you found this helpful, drop a like and share it with someone learning security. If you have questions, ran into something different in your own lab, or want to share your results, leave a comment below. Always happy to connect and talk about security, recon techniques, or anything AppSec related.

Feel free to connect with me on LinkedIn

Always open to connecting with people in security, development, or both. Whether you are building something, breaking something, or just getting started, feel free to reach out.

Top comments (0)