Priyanshu Kumar Sinha

Posted on Jan 2

Think That Website Looks Safe? Meet WebShield, Your Cybersecurity Ally!

Cybersecurity is not a product, it's a process. – Bruce Schneier

Have you ever wondered how safe the websites you visit are? That’s the question we aimed to tackle with WebShield, our cybersecurity project from the recent hackathon.

WebShield is designed to detect suspicious websites by analyzing multiple layers of their data, such as IP addresses, domain details, SSL certificates, and much more.

It combines technical prowess with user-friendly insights to make the internet a safer space for everyone.

Whether you’re a tech enthusiast or someone simply curious about cybersecurity, this post will guide you through WebShield’s workings, its glossary, and its next-level potential with the integration of Large Language Models (LLMs). Let’s dive in!

About Me

Hi, I’m Priyanshu Kumar Sinha, currently pursuing my B.Tech in Computer Science and Business Systems at Dayananda Sagar College of Engineering. I’ve always been passionate about solving real-world problems through technology.

The idea for WebShield arose from a recurring issue we noticed: many suspicious websites utilize services like Cloudflare to mask their hosting details.

Despite contacting providers like Cloudflare, their response often clarified that they only offered services like SSL certificates and were not responsible for hosting, leaving us without accurate information about the website’s origin. This motivated us to design a system capable of bypassing such hurdles.

Hackathon Experience: The Journey to Pondicherry

I, along with my teammates Sneha, Vishrutha, and Adithi
participated in this hackathon in Pondicherry to create WebShield. We traveled all the way from Bangalore to Pondicherry, which was an adventure in itself! The hackathon provided a perfect environment for collaboration, brainstorming, and a race against time to turn our idea into a functional application.

Interestingly, during the initial stages of exploring phishing threats, I stumbled upon a website while using Adithi’s laptop that installed some kind of virus. This was a wake-up call and further strengthened our resolve to create a robust solution. To make things engaging, we thought of including a screenshot of the malicious application right on the front page of WebShield, so users can immediately recognize such threats.

Here’s a snapshot of our system architecture, showcasing how each component seamlessly integrates to deliver results:

Glossary: Making Cybersecurity Terms Accessible

Understanding cybersecurity requires grappling with some technical jargon. Here’s a quick glossary of terms central to WebShield:

CDNs (Content Delivery Networks):

Think of a CDN as a super-efficient delivery truck. It speeds up website loading times by hosting data closer to you. However, bad actors sometimes exploit CDNs like Cloudflare to hide their website’s real location, making detection trickier.
APIs (Application Programming Interfaces):

APIs act like messengers. They allow our app to communicate with external services, such as WHOIS or Shodan, to fetch relevant data about websites.
DNS (Domain Name System):

DNS serves as the internet’s address book. When you type a website’s URL, DNS translates it into its corresponding IP address (e.g., 192.168.1.1).
WHOIS Data:

This is essentially a website’s birth certificate. It provides information about the domain owner, registration date, and more.
SSL Certificates:

Ever noticed the padlock icon in your browser? It indicates that the website uses SSL (Secure Sockets Layer) to encrypt data, ensuring secure communication.
Reputation Score:

A metric calculated based on various factors like SSL validity, DNS details, and WHOIS data to assess a website’s trustworthiness.

How Does WebShield Work?

WebShield is a multi-step system combining various data analysis methods to evaluate website safety. Here’s how it works:

Step 1: User Inputs a Website

You start by entering a domain name (e.g., suspicious-site.com) into WebShield’s interface.

Step 2: Backend Fetches Data

The backend retrieves detailed information about the website using APIs like:

DNS: Resolves the website’s IP address.
WHOIS: Fetches domain registration and ownership details.
Shodan: Analyzes open ports and server information.
SSL Checker: Verifies the website’s SSL certificate.
VirusTotal: Checks the website against a database of known malicious URLs.
any many more ...

Step 3: Data Analysis and Scoring

This step involves analyzing the gathered data and calculating a reputation score based on various factors. For instance:

Valid HTTPS: +2 points
Recent WHOIS data: +1 point
No suspicious patterns in VirusTotal: +2.5 points

Example Code: Calculating Reputation Score

let reputation = 0;
if (sslCheckerData.result.cert_valid) {
  reputation += 2.8; // Bonus for valid HTTPS
}
if (whoisData["Creation Date"]) {
  reputation += 2.2; // Bonus for WHOIS availability
}
console.log("Reputation Score:", reputation);

Step 4: The Final Verdict

Based on the reputation score, WebShield classifies the website into categories:

Safe: No red flags detected.
Suspicious: Requires caution.
Malicious: Likely harmful.

Challenges and Solutions

Challenges:

Many suspicious websites use CDNs like Cloudflare, which mask their actual hosting details, making it difficult to trace their origins.
Even after contacting CDN providers, the responses typically only confirm the use of services like SSL without revealing hosting information.

Solutions:

Bypassing intermediary services like Cloudflare to retrieve accurate hosting information, including the real IP address and hosting provider.
Utilizing advanced techniques such as reverse DNS lookups and historical data analysis to uncover hidden hosting details.
Developing a robust scoring mechanism that combines raw data with contextual insights to enhance detection accuracy.

Taking It to the Next Level with LLMs

While WebShield is already effective, integrating a Large Language Model (LLM) like GPT-4 can elevate its capabilities. Here’s how:

1. Analyze Complex Patterns

LLMs can interpret subtle correlations within raw data—for example, identifying unusual patterns in IP changes or mismatched WHOIS information.

2. Provide Explanations

Instead of just flagging a website, the LLM could explain why it’s considered risky. For instance: “The website’s SSL certificate is expired, and the WHOIS data suggests frequent domain transfers.”

3. Dynamic Scoring

LLMs can weigh factors dynamically, improving the reputation score’s accuracy.

Sample Code: LLM Integration

const axios = require("axios");

const prompt = `
Analyze the following website data:
- IP Address: ${ipinfoData.ip}
- WHOIS: ${JSON.stringify(whoisData)}
- SSL Certificate: ${sslCheckerData.result.cert_valid ? "Valid" : "Invalid"}

Is the website malicious? Why?
`;

const response = await axios.post("https://api.openai.com/v1/chat/completions", {
  model: "gpt-4",
  messages: [{ role: "user", content: prompt }],
  headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` },
});

console.log("LLM Analysis:", response.data.choices[0].message.content);

Why This Matters

"Security is an investment, not an expense." – Anonymous

Cybersecurity is more than just a technical field; it’s a critical layer of trust in today’s digital age. WebShield addresses this by simplifying complex analyses and delivering actionable insights to users.

With LLM integration, WebShield could:

Empower non-technical users with clear explanations of risks.
Offer adaptive scoring for more nuanced detection.
Bridge the gap between raw data and user understanding.

What’s Next for WebShield?

We envision a future where WebShield evolves into a comprehensive cybersecurity toolkit. Future plans include:

Real-time Monitoring: Adding live scanning capabilities for continuous safety checks.
Browser Extensions: Integrating WebShield directly into browsers for instant feedback.
Community Reports: Allowing users to report and review flagged websites, fostering a crowdsourced defense system.