Fake news detection with Keras and Python

#python #keras #tensorflow

In the digital age, where information spreads faster than ever, fake news has become a real threat. Whether it's misleading headlines or entirely fabricated stories, false information can influence public opinion, sway elections, or even endanger lives. To combat this, I built a sample of machine learning-based fake news detection model using TensorFlow and Natural Language Processing (NLP).

Overview

This project demonstrates a practical implementation of Fake News Detection using TensorFlow, leveraging Natural Language Processing (NLP) techniques and deep learning. The model is trained to classify whether a given news article is real or fake, which is a crucial task in today's information-driven society.

Technologies Used

TensorFlow 2.x – for building and training the neural network
Keras – high-level API for defining the model
Natural Language Processing (NLP) – tokenization, stopword removal, vectorization
Scikit-learn – for preprocessing, splitting datasets, and evaluating model accuracy
Pandas & NumPy – for data manipulation
Matplotlib – for result visualization

Features

Binary classification: Fake (1) vs. Real (0)
Dataset preprocessing: stopwords, stemming, tokenization
Deep learning model using LSTM layers
Training and validation performance tracking
Organized Jupyter notebook for easy understanding
Model evaluation with confusion matrix, accuracy, and loss visualization

Dataset

The model uses a labeled dataset of news articles consisting of:

Title and Text
Label: 1 for fake news, 0 for real news

You can easily swap in another dataset or extend it with more complex sources for multilingual or multi-category classification.

Model Performance

The model achieves good accuracy in distinguishing real and fake news articles after training with appropriate preprocessing and regularization. Detailed metrics and visual plots are available in the notebook for transparency and fine-tuning.

This is sample code for detect_fake_news.py

import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import json
import os

MAX_LENGTH = 500
VOCAB_SIZE = 10000
USER_DATA_PATH = "user_training_data.csv"
FAKE_CSV_PATH = "data/Fake.csv"

def train_model(extra_texts=[], extra_labels=[]):
    # Load base datasets
    df_fake = pd.read_csv(FAKE_CSV_PATH)
    df_fake['label'] = 0

    df_true = pd.read_csv("data/True.csv")
    df_true['label'] = 1

    df = pd.concat([df_fake[['text', 'label']], df_true[['text', 'label']]])

    # Load and include user data if available
    if os.path.exists(USER_DATA_PATH):
        user_df = pd.read_csv(USER_DATA_PATH)
        df = pd.concat([df, user_df], ignore_index=True)

    # Append new user data if provided
    if extra_texts and extra_labels:
        new_data = pd.DataFrame({"text": extra_texts, "label": extra_labels})
        df = pd.concat([df, new_data], ignore_index=True)

        # Append new data to Fake.csv if it is labeled as fake news
        fake_data = new_data[new_data['label'] == 0]  # Only append if label is fake
        if not fake_data.empty:
            fake_data.to_csv(FAKE_CSV_PATH, mode='a', header=False, index=False)

        # Save new data to user_training_data.csv
        new_data.to_csv(USER_DATA_PATH, mode='a', header=not os.path.exists(USER_DATA_PATH), index=False)

    df.dropna(inplace=True)

    # Tokenize and train model
    tokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token="<OOV>")
    tokenizer.fit_on_texts(df['text'])
    sequences = tokenizer.texts_to_sequences(df['text'])
    padded = pad_sequences(sequences, maxlen=MAX_LENGTH, padding='post', truncating='post')
    X_train, X_test, y_train, y_test = train_test_split(padded, df['label'], test_size=0.2, random_state=42)

    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(VOCAB_SIZE, 16, input_length=MAX_LENGTH),
        tf.keras.layers.GlobalAveragePooling1D(),
        tf.keras.layers.Dense(24, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test))

    model.save('fake_or_true_news_model.keras')
    with open("tokenizer.json", "w") as f:
        f.write(tokenizer.to_json())

    print("✅ Model trained and saved successfully.")

Full Sample Code

DEV Community