Jer Catallo

Posted on May 4

Manual Web Content Discovery: How You Can Find Hidden Paths Before Attackers Do

#cybersecurity #programming #tutorial #todayilearned

Manual content discovery is a core skill in application security testing. Instead of relying only on automated scanners, you can use simple HTTP requests and browser tools to find exposed files, hidden paths, and technology fingerprints. This covers techniques like checking robots.txt, fingerprinting favicons, reading sitemap.xml, inspecting HTTP headers, and spotting framework markers in HTML source.

These methods help you understand a target's structure and find information disclosure issues early, before running heavy scanning tools.

Ethical Considerations

Only test systems you own or have explicit written permission to assess.
Follow the defined scope, timing, and rules of engagement set by the owner.
Stop immediately if you find data outside scope and report it through approved channels.
Use findings for defense and remediation, not exploitation.
Treat discovered paths like admin or staff portals as sensitive data. Do not brute-force or abuse them.
Do not publish sensitive headers, tokens, or internal values outside approved reports.

Robots.txt Analysis

The robots.txt file tells web crawlers which paths to avoid. It can accidentally reveal sensitive routes like admin panels or staff portals.

curl -s https://<target-domain>/robots.txt

This command fetches the robots.txt file so you can check Disallow and Allow directives for hidden paths.

The response shows a Disallow: /staff-portal directive under User-agent: *. This means the site owner does not want crawlers to index the staff portal, but the path is still visible to anyone who checks this file.

Result: The /staff-portal route is exposed through robots.txt. While this does not mean the path is vulnerable, it gives you a starting point for further authorized testing.

Remediation: Remove sensitive paths from robots.txt. Use proper authentication and authorization controls to protect those routes instead. Security through obscurity is not a reliable protection.

Favicon Fingerprinting

Favicons are small icons that browsers display in tabs. Different frameworks and products use unique favicon files, so you can calculate a hash and match it against known databases to identify the technology.

curl -s https://<target-domain>/favicon.ico | md5sum

This downloads the favicon and calculates its MD5 hash for comparison.

The browser network tab confirms a successful HTTP 200 response for favicon.ico.

The calculated MD5 hash is f276b19aabcb4ae8cda4d22625c6735f.

Searching this hash in the OWASP favicon database returns a match for cgiirc (0.5.9).

Result: The favicon hash maps to cgiirc (0.5.9), an IRC web client. This suggests the target may reuse assets from this product or run it in the background. You can use this information to check for known issues with this version.

Remediation: Replace default framework or third-party favicons with a custom one. This prevents passive technology identification through favicon hashing.

Sitemap.xml Enumeration

The sitemap.xml file lists pages that the site wants search engines to index. It often reveals old routes, API endpoints, or parameterized URLs you might not find through normal browsing.

curl -s https://<target-domain>/sitemap.xml

This retrieves the sitemap to find discoverable paths and endpoints.

The sitemap contains multiple URL entries including /news/, /contact, and parameterized article paths with sequential IDs like news/article?id=1, news/article?id=2, and news/article?id=3.

Result: The sitemap exposes several routes and a pattern for article IDs. You can use this to map out the content structure and check for IDOR or other parameter-based issues on these endpoints.

Remediation: Avoid listing sensitive or internal endpoints in sitemap.xml. Only include public-facing, intended content. For parameterized URLs, validate and authorize each request server-side.

HTTP Header Inspection

HTTP response headers contain metadata about the server, security configuration, and sometimes version information. Missing security headers or verbose server details can reveal weaknesses.

curl -I https://<target-domain>

This sends a HEAD request to get only the response headers without the full page body.

The headers show Server: nginx/1.18.0 (Ubuntu) and a custom X-FLAG: THM{HEADER_FLAG} header.

Result: The Server header leaks the exact web server version and operating system. This helps you narrow down potential version-specific issues. The response also lacks important security headers like Content-Security-Policy and Strict-Transport-Security, which means the site may be vulnerable to clickjacking or downgrade attacks.

Remediation: Configure your web server to suppress or mask the Server header. Add security headers like Content-Security-Policy, Strict-Transport-Security, X-Frame-Options, and X-Content-Type-Options. You can use tools like securityheaders.com to check your current header posture.

Framework Stack Identification

Web frameworks often leave markers in HTML source code, such as generator comments or meta tags. These markers reveal the technology stack and sometimes the exact version.

curl -s https://<target-domain> | grep -i "generated\|framework"

This fetches the homepage HTML and filters for framework-related comments.

The HTML source contains a comment showing the page was generated using the THM Framework.

Visiting the framework reference URL confirms it is the THM Web Framework with visible version details.

Result: The source comment reveals THM Framework v1.2 as the underlying technology. You can now research this framework for known misconfigurations, default paths, or version-specific vulnerabilities.

Remediation: Strip generator comments and version markers from production HTML before deployment. Configure your build pipeline or template engine to exclude debug and version metadata from rendered output.

Summary

Manual content discovery gives you a clear picture of a target without heavy tooling. You can see how robots.txt can leak sensitive paths, favicon hashes can identify technologies, sitemap.xml can map out hidden routes, HTTP headers can expose server versions and missing security controls, and HTML source comments can reveal framework details. These techniques work well as a first step before running automated scanners and help build a stronger picture of the target's attack surface.

If you found this helpful, drop a like and share it with someone learning security. If you have questions, ran into something different in your own lab, or want to share your results, leave a comment below. Always happy to connect and talk about security, recon techniques, or anything AppSec related.

Feel free to connect with me on LinkedIn

Always open to connecting with people in security, development, or both. Whether you are building something, breaking something, or just getting started, feel free to reach out.