<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: zerOiQ</title>
    <description>The latest articles on DEV Community by zerOiQ (@medamyyne).</description>
    <link>https://dev.to/medamyyne</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1118600%2Faddb3b91-a44b-4a89-b20e-3e5d32744455.png</url>
      <title>DEV Community: zerOiQ</title>
      <link>https://dev.to/medamyyne</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/medamyyne"/>
    <language>en</language>
    <item>
      <title>Dark Web Scraping Using AI : Tools, Techniques, and Challenges</title>
      <dc:creator>zerOiQ</dc:creator>
      <pubDate>Sun, 13 Jul 2025 16:48:53 +0000</pubDate>
      <link>https://dev.to/medamyyne/dark-web-scraping-using-ai-tools-techniques-and-challenges-48c0</link>
      <guid>https://dev.to/medamyyne/dark-web-scraping-using-ai-tools-techniques-and-challenges-48c0</guid>
      <description>&lt;h2&gt;
  
  
  The Dark Web has a lot of useful information, especially in hidden forums and marketplaces. But getting to that data isn’t always easy. You’ll often run into things like CAPTCHAs and tools that block web scraping. And writing your own code to get past these issues can take a lot of time and effort.
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;In this blog, i’ll show you an easier way. Using Llama, we can collect, understand, and even talk about Dark Web data without writing hundreds of lines of code or getting stuck on common problems.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Can you believe it? Amazing, right? Let’s dive in!&lt;/p&gt;

&lt;p&gt;As always, our go-to tool for building powerful applications is Python🐍!&lt;br&gt;
To get started, let’s explore the necessary requirements and dependencies .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#requirements.txt

streamlit 
langchain 
langchain_ollama
selenium
beautifulsoup4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Streamlit&lt;/strong&gt; allows for quick creation of interactive web apps, &lt;strong&gt;LangChain&lt;/strong&gt; offers a framework for building workflows with language models for tasks like Q&amp;amp;A and summarization, with &lt;strong&gt;LangChain_Ollama&lt;/strong&gt; enabling integration with Ollama models , while &lt;strong&gt;Selenium&lt;/strong&gt; automates browser actions for testing and scraping dynamic pages, and &lt;strong&gt;BeautifulSoup4&lt;/strong&gt; parses data from HTML and XML.&lt;/p&gt;

&lt;p&gt;I recommend creating a Python Virtual Environment to keep all our project requirements organized, and here’s how&lt;/p&gt;

&lt;p&gt;&lt;code&gt;python3 -m venv &amp;lt;venv_name&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Then activate it using&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;lt;venv_name&amp;gt;\Scripts\activate&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now, let’s create a file called &lt;code&gt;requirements.txt&lt;/code&gt;, and Install all the requirements with:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip3 install -r requirements.txt&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;With everything ready, we’ll create our main Python file to build the app’s interface, so we’ll use &lt;code&gt;Streamlit&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#main.py

import streamlit as st

st.title ("zerOiQ Scraper")

url = st.text_input("Search For Website")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And we could add something unique, like a banner, for example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;st.markdown(
    """
    ```


    _________________/\\\\\___________________/\\\_________________         
  _______________/\\\///\\\______________/\\\\/\\\\______________        
   _____________/\\\/__\///\\\___/\\\___/\\\//\////\\\____________       
    ____________/\\\______\//\\\_\///___/\\\______\//\\\___________      
     ___________\/\\\_______\/\\\__/\\\_\//\\\______/\\\____________     
      ___________\//\\\______/\\\__\/\\\__\///\\\\/\\\\/_____________    
       ____________\///\\\__/\\\____\/\\\____\////\\\//_______________   
        ______________\///\\\\\/_____\/\\\_______\///\\\\\\____________  
         ________________\/////_______\///__________\//////_____________



     ```
""",
    unsafe_allow_html=True
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and the result should be after running&lt;/p&gt;

&lt;p&gt;&lt;code&gt;streamlit main.py --server.port &amp;lt;Port&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now, let’s move ahead and build our scraping function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#scrape.py

import selenium.webdriver as webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options

def scrapWebsite(website):

    options = Options()

    #headless Browsing
    options.add_argument("--headless")

    # Path to the firefox WebDriver and Profile
    options.profile = r" FireFox Profile "
    firefoxDriver_path = "./geckodriver.exe"

    service = Service(firefoxDriver_path) 
    driver = webdriver.Firefox(service=service, options=options)

    try:
        driver.get(website)
        print("page loaded ..")
        html = driver.page_source
        return html

    except Exception as e:
        print(f"An error occurred: {str(e)}")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;To retrieve your Firefox profile path, simply open Firefox, type about:profiles in the address bar, and just find the one you’re using, and copy the Root Directory path .&lt;/p&gt;

&lt;p&gt;In summary, this code aims to load a website in headless Firefox and retrieve the HTML Source Code . &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Got it! Now that we have the website loaded, let’s create a function to extract data from it using &lt;code&gt;BeautifulSoup&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#scrape.py
..
def extract_body_content(html_content):
    soup = BeautifulSoup(html_content, "html.parser")
    body_content = soup.body
    if body_content:
        return str(body_content)
    return "Nothing"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;extract_body_content&lt;/code&gt; function extracts the main content within the &lt;code&gt;&amp;lt;body&amp;gt; tag&lt;/code&gt; from an HTML page.&lt;/p&gt;

&lt;p&gt;It takes &lt;code&gt;html_content&lt;/code&gt; as input, uses &lt;code&gt;BeautifulSoup&lt;/code&gt;to parse it, and then finds the &lt;code&gt;&amp;lt;body&amp;gt; tag&lt;/code&gt; . If the &lt;code&gt;&amp;lt;body&amp;gt; tag&lt;/code&gt; exists, it returns its content as a string; if not, it returns "Nothing".&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This function is useful for isolating the primary content of a webpage .&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After extracting the data, we need to clean it up and retain only the relevant information.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#scrape.py
..
def clean_body_content(body_content):
    soup = BeautifulSoup(body_content, "html.parser")

    for script_or_style in soup(["script", "style"]):
        script_or_style.extract()

    cleaned_content = soup.get_text(separator="\n")
    cleaned_content = "\n".join(
        line.strip() for line in cleaned_content.splitlines() if line.strip()
    )

    return cleaned_content

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;clean_body_content&lt;/code&gt; function is designed to filter out unnecessary elements and keep only relevant text from the HTML body content.&lt;/p&gt;

&lt;p&gt;First, it takes &lt;code&gt;body_content&lt;/code&gt; as input and parses it with BeautifulSoup. It removes all &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;style&amp;gt;&lt;/code&gt; tags to eliminate JavaScript and CSS content, which aren’t typically relevant.&lt;/p&gt;

&lt;p&gt;Then, it retrieves the remaining text and formats it by stripping out extra whitespace. It does this by splitting the text into lines, removing empty lines, and joining the cleaned lines with line breaks and , readable text.&lt;/p&gt;

&lt;p&gt;And The &lt;code&gt;split_dom_content()&lt;/code&gt; function breaks down a large HTML or text content dom_content into smaller, manageable chunks, each with a maximum length specified by max_length .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#scrape.py
..
def split_dom_content(dom_content, max_length=6000):
    return [
        dom_content[i : i + max_length] for i in range(0, len(dom_content), max_length)
    ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It does this by iterating over &lt;code&gt;dom_content&lt;/code&gt;in increments of &lt;code&gt;max_length&lt;/code&gt;, creating a list where each element is a slice of &lt;code&gt;dom_content&lt;/code&gt;no longer than &lt;code&gt;max_length&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This function is especially useful for handling long text data in parts, such as when processing or analyzing content in limited-size segments.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now let’s integrate the functions we created into the main function and call them in sequence to process the webpage content.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#main.py
..
from Scrape import (
    scrapWebsite , 
    split_dom_content , 
    clean_body_content , 
    extract_body_content
    )

st.title ("AI Scraper")

url = st.text_input("Search For Website")

if st.button("Start Scraping"):
    if url :
        st.write("Scraping...")
        result = scrapWebsite(url)  
        # print(result)
        bodyContent = extract_body_content(result)
        cleanedContent = clean_body_content(bodyContent)

        st.session_state.dom_content = cleanedContent

        with st.expander ("View All Content") :
            st.text_area("Content" , cleanedContent, height=300)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By creating a simple Streamlit interface to scrape and display webpage content, the user enters a URL in a text input field &lt;code&gt;st.text_input&lt;/code&gt; .&lt;/p&gt;

&lt;p&gt;When the "Start Scraping" button is clicked, it checks if the URL field has content , if so, it initiates the scraping process by calling &lt;code&gt;scrapWebsite(url)&lt;/code&gt;, which retrieves the raw HTML. The &lt;code&gt;extract_body_content&lt;/code&gt;function then isolates the main &lt;/p&gt; content, and &lt;code&gt;clean_body_content&lt;/code&gt; filters out unnecessary elements, like &lt;code&gt;scripts&lt;/code&gt; and &lt;code&gt;styles&lt;/code&gt;.

&lt;p&gt;The cleaned content is stored in &lt;code&gt;st.session_state.dom_content&lt;/code&gt; for session persistence. Lastly, a text area displays the cleaned content, allowing the user to view the extracted text.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Until now, we’ve done a great job creating the web scraper and extracting all data from our target website.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flv2fn4ykayyyj7gom3vl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flv2fn4ykayyyj7gom3vl.png" alt="Screenshot" width="800" height="582"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We’ve successfully built our web scraper, extracted the main content from our target website, and even cleaned up the data to focus on relevant information. We now have a tool that can dynamically retrieve and organize webpage content with just a few clicks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We are now moving on to integrating Llama for content analysis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#main.py
..
if "dom_content" in st.session_state:
    parse_description = st.text_area("Describe what's in Your Mind ..'")

    if st.button("Parse Content"):
        if parse_description:
            st.write("Parsing the content...")

            # Parse the content with Ollama
            dom_chunks = split_dom_content(st.session_state.dom_content)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First, we check if &lt;code&gt;dom_content&lt;/code&gt;is available in &lt;code&gt;st.session_state&lt;/code&gt;, indicating that we have scraped data ready for processing. Then, we display a text area for the user to input a description of what they want to analyze.&lt;/p&gt;

&lt;p&gt;When the Parse Content button is clicked, we ensure that the user has entered a description and proceed to parse the content.&lt;/p&gt;

&lt;p&gt;We split the scraped data into smaller chunks using &lt;code&gt;split_dom_content&lt;/code&gt;to make it easier for Llama to handle.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This integration will allow us to interactively analyze and interpret the scraped content.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To integrate Llama into our code, we can create a new Python file called &lt;code&gt;Ollama.py&lt;/code&gt; for example and import the necessary libraries to connect to and interact with Llama .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Ollama.py

from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;OllamaLLM&lt;/code&gt;allows connecting and interacting with the Ollama language model for tasks like text processing and analysis , andChatPromptTemplate helps create templates for structured chat based prompts to send to the language model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Ollama.py
..
model = OllamaLLM(model="llama3")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It creates an instance of the OllamaLLM class, which connects to the llama3 model.&lt;/p&gt;

&lt;p&gt;and then we need to import this template :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Ollama.py
..
template = (
    "You are tasked with extracting specific information from the following text content: {dom_content}. "
    "Please follow these instructions carefully: \n\n"
    "1. **Extract Information:** Only extract the information that directly matches the provided description: {parseDescription}. "
    "2. **No Extra Content:** Do not include any additional text, comments, or explanations in your response. "
    "3. **Empty Response:** If no information matches the description, return an empty string ('')."
    "4. **Direct Data Only:** Your output should contain only the data that is explicitly requested, with no other text."
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By creating a template with clear instructions for the Ollama model, it tells the model to extract specific information from the dom_content based on the parseDescription.&lt;/p&gt;

&lt;p&gt;The model is instructed to only return the information that matches the description, avoid adding extra details, return an empty string if nothing matches, and provide only the requested data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This ensures the extraction is focused and accurate.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#Ollama.py
..
def parseUsingOllama (domChunks , parseDescription) :
    prompt = ChatPromptTemplate.from_template(template)
    chain = prompt | model
    finalResult = []
    for counter , chunk in enumerate (domChunks , start=1) :
        result = chain.invoke(
                                {"dom_content": chunk , "parseDescription": parseDescription}
                              )
        print(f"Parsed Batch {counter} of {len(domChunks)}")
        finalResult.append(result)
    return "\n".join(finalResult)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The parseUsingOllama function takes two inputs: domChunks (which are parts of the content) and parseDescription (which tells what information to extract). It creates a prompt with instructions for the model and then processes each chunk of content one by one.&lt;/p&gt;

&lt;p&gt;For each chunk, the function asks the model to extract the relevant information based on the description.&lt;/p&gt;

&lt;p&gt;It stores the results and shows which chunk is being processed. Finally, it returns all the results combined into one string.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This function makes it easier to extract specific details from large content by breaking it down into smaller parts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To use the Llama model, we’ll need to install it from the official Ollama WEBSITE , and here’s a helpful GUIDE to assist you with the installation and running process [&lt;a href="https://www.youtube.com/watch?v=PuJMmzGYZcY" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=PuJMmzGYZcY&lt;/a&gt;].&lt;/p&gt;

&lt;p&gt;If you encounter any issues, check out the GitHub repository for additional useful commands and detailed instructions [&lt;a href="https://github.com/ollama/ollama" rel="noopener noreferrer"&gt;https://github.com/ollama/ollama&lt;/a&gt;].&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ylcw2hu49wptmh62303.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ylcw2hu49wptmh62303.png" alt="screenshot" width="800" height="196"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To integrate the Ollama parsing function back into our main application, we need to import the &lt;code&gt;parseUsingOllama&lt;/code&gt;function from the parseOllama file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#main.py
..
from parseOllama import parseUsingOllama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once imported, we can call the parseUsingOllama function within our main function to process the content with the provided description&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#main.py
..
parsed_result = parseUsingOllama(dom_chunks, parse_description)
            st.write(parsed_result)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, the only thing left is to run our program and see it in action!&lt;/p&gt;

&lt;p&gt;If everything looks good, we’ll know our program is ready and working as planned !&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In summary, we created a streamlined web scraper that retrieves, cleans, and displays relevant content from a target website. We then integrated Llama via the Ollama API to help us analyze and extract specific information from the scraped data using a custom template. By breaking the content into manageable chunks and applying clear instructions, we ensure that only the most relevant data is displayed based on the user’s input .&lt;br&gt;
This approach makes the data extraction process efficient, accurate, and user-friendly, and provides a solid foundation for further analysis and interaction with scraped web content .&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>python</category>
      <category>coding</category>
    </item>
  </channel>
</rss>
