DEV Community: Kehinde Abe

Building a Sarcasm Detection System with LSTM and GloVe: A Complete Guide

Kehinde Abe — Thu, 02 Jan 2025 10:28:53 +0000

Before We Begin :)

Detecting sarcasm is more than just spotting ironic statements. It involves understanding tone, context, and sometimes even cultural nuances. Sarcasm can be difficult for machines to detect in social media posts, news headlines, or everyday conversations because it contradicts the literal meaning of words. Yet, modern NLP techniques can pick up on these subtleties better than ever with the right approach and data preprocessing.

Below, you’ll find a detailed, step-by-step guide on how to build your sarcasm detection model using LSTM (Long Short-Term Memory) networks and GloVe embeddings. From data cleaning and preprocessing to model deployment in a Streamlit application, this post covers every element you need to create a robust sarcasm detection system.

Introduction
Tools and Environment Setup
Importing Libraries
Loading and Inspecting Data
Data Cleaning
Removing Special Characters
Additional Noise Removal (URLs, HTML, Non-ASCII, Punctuation)
Handling Slang, Acronyms, and Common Abbreviations
Stopword Removal and Lemmatization
Using GloVe Embeddings
Creating the Embedding Matrix
Creating Feature Vectors
Building the LSTM Model
Preparing Data for the LSTM Model
Defining the LSTM Architecture
Training the Model
Saving the Model and Tokenizer
Deployment with Streamlit
Putting It All Together
Conclusion
Next Steps

1. Introduction

Sarcasm detection is a fascinating natural language processing (NLP) challenge. Sarcastic statements often convey the opposite of their literal meaning, making them tricky for machines to identify. For instance, the sentence “I love getting stuck in traffic for hours” may say you enjoy traffic, but in reality, you mean the opposite. Automated sarcasm detection requires models that can glean subtle contextual cues. In this post, we’ll train an LSTM model on a sarcasm headlines dataset and deploy it using Streamlit to create a friendly, interactive web interface.

This step-by-step guide will show you how to:

Load and analyze a sarcasm dataset.
Clean and preprocess the text data (removing special characters, URLs, punctuation, etc.).
Use GloVe embeddings (a widely used word embedding technique) to represent our text.
Build and train an LSTM model to classify whether a sentence is sarcastic.
Deploy our trained model on Streamlit for a user-friendly web interface.

By the end of this tutorial, you will have a functional sarcasm-detection application!

2. Tools and Environment Setup

To run this project, you’ll need a few key tools and libraries:

Python 3.x (preferably 3.7+)
Pip or conda for installing packages
TensorFlow and Keras for building deep learning models
NLTK for text processing
GloVe embeddings – specifically glove.6B.100d.txt
Streamlit for deployment

Recommended steps to set up a virtual environment (using pip and venv as an example):

# Create and activate virtual environment
python -m venv sarcasm-env
source sarcasm-env/bin/activate  # Linux/Mac
# or:
sarcasm-env\Scripts\activate  # Windows

# Install necessary libraries
pip install numpy pandas matplotlib plotly scikit-learn nltk tensorflow joblib streamlit xgboost lightgbm catboost

Then, download the GloVe file (glove.6B.100d.txt) and place it in a folder named dataset/.

3. Importing Libraries

We start by importing libraries for data manipulation and NLP tasks and building deep learning models. Below is the code snippet we use in our Jupyter Notebook or Python script. This includes everything for data manipulation, NLP tasks, and building deep learning models.

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

import joblib
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import re
import itertools    
import wordcloud

# For data preprocessing
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer

# For building our Models
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Conv1D, Bidirectional, SpatialDropout1D, Dropout
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam
from sklearn.svm import LinearSVC, SVC
from sklearn.model_selection import cross_val_score, KFold, cross_val_predict, cross_validate

from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.models import Model

# For Lazy Predict (commented out)
# from lazypredict.Supervised import LazyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# For hyperparameter tuning
from scipy.stats import uniform

# Reduce dimensions to 2 for faster training
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# For creating vocabulary dictionary
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# For model evaluation
from sklearn.model_selection import LearningCurveDisplay, learning_curve
from sklearn.metrics import confusion_matrix, classification_report, log_loss, make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, auc, DetCurveDisplay, RocCurveDisplay, roc_curve, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV

# For processing texts
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

What’s happening here?

warnings.filterwarnings('ignore'): Suppresses any warnings for cleaner output.
pandas, numpy, matplotlib, plotly, etc.: Data handling, visualization libraries.
nltk, WordNetLemmatizer, stopwords: Common NLP libraries for text processing.
tensorflow, keras: Building and training our deep learning model (LSTM).
sklearn: Traditional machine learning tools, plus utilities for train/test split, hyperparameter tuning, etc.
joblib: For saving and loading models/tokenizers.

4. Loading and Inspecting Data

Here, we load a CSV file that contains sarcasm headlines data. Let’s see how many records we have and if there are duplicates.

data = pd.read_csv('dataset/sarcasm_headlines.csv')

def check_duplicates(data):
    duplicate = data.duplicated().sum()
    return duplicate

print(check_duplicates(data))

pd.read_csv(...) loads our CSV data into a DataFrame named data.
check_duplicates function sums up duplicated rows (if any).
print(check_duplicates(data)) shows how many duplicate entries exist.

5. Data Cleaning

Sarcasm detection often depends on subtle textual cues, and clean, standardized input can significantly improve model performance.

5.1 Removing Special Characters

# Show the special characters in the text column
data[data['text'].str.contains(r'[^A-Za-z0-9 ]', regex=True)]

# Function to remove special characters
def remove_special_characters(text):
    text = re.sub(r'[^A-Za-z0-9 ]', '', text)
    return text

# Apply function to remove special characters
data['text'] = data['text'].apply(remove_special_characters)

data['text'].str.contains(...): Check if special characters exist in each row’s text.
remove_special_characters: Uses a regular expression to replace all non-alphanumeric characters (excluding space) with nothing.
apply(...): This function applies to each row in the text column.

Further Checks

def special_characters(data):
    special = data.str.contains(r'[^A-Za-z0-9 ]', regex=True).sum()
    return special

special_characters(data['text'])

This confirms if any special characters remain after we’ve removed them.

5.2 Additional Noise Removal (URLs, HTML, Non-ASCII, Punctuation)

Here, we define small utility functions:

def remove_URL(text):
    return re.sub(r"https?://\S+|www\.\S+", "", text)

def remove_html(text):
    html = re.compile(r"<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
    return re.sub(html, "", text)

def remove_non_ascii(text):
    return re.sub(r'[^\x00-\x7f]',r'', text)

def remove_punct(text):
    return re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', "", text)

remove_URL: Removes URLs.
remove_html: Removes HTML tags or entities.
remove_non_ascii: Removes non-ASCII characters.
remove_punct: Removes punctuation.

Additional Slang, Acronyms, and Common Abbreviations

def other_clean(text):
    # Contains dictionaries of slang, acronyms, abbreviations
    # Replaces them with their expanded forms
    ...
    return text

This function corrects typos and expands acronyms like “wtf” to “what the fuck” to make the text more standardized.

Finally, we apply them all:

data["text"] = data["text"].apply(lambda x: remove_URL(x))
data["text"] = data["text"].apply(lambda x: remove_html(x))
data["text"] = data["text"].apply(lambda x: remove_non_ascii(x))
data["text"] = data["text"].apply(lambda x: remove_punct(x))
data["text"] = data["text"].apply(lambda x: other_clean(x))

Stopword Removal and Lemmatization

Stopwords like “the,” “is,” “at” often don’t contribute much to classification. Lemmatization ensures words are reduced to their base form.

nltk.download('stopwords')
nltk.download('wordnet')
stop = stopwords.words('english')

data['removed_stopwords'] = data['text'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (stop)])
)

def lemmatized_text(corpus):
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

data['lemmatized_texts'] = lemmatized_text(data['removed_stopwords'])

Now, each headline is clean, de-noised, and lemmatized, forming our final text data.

6. Using GloVe Embeddings

We’ll use GloVe (Global Vectors for Word Representation) as our word embeddings, specifically the glove.6B.100d.txt file. You must have this file in your dataset/ directory for the code to work.

import os

file_path = "dataset/glove.6B.100d.txt"
print("File exists:", os.path.isfile(file_path))

def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, encoding="utf8") as file:
        for line in file:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = vector
    return embeddings_index

if os.path.isfile(file_path):
    embeddings_index = load_glove_embeddings(file_path)
else:
    raise FileNotFoundError("GloVe file not found. Please check the file path.")

print("Number of words in GloVe embeddings:", len(embeddings_index))

load_glove_embeddings: Reads the GloVe file line by line, splitting each line into a word and its corresponding vector.
The dictionary embeddings_index holds a mapping from words -> 100-dimensional float vectors.

6.1 Creating the Embedding Matrix

We’ll tokenize our texts and build an embedding matrix with shape (vocab_size, 100):

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(data['lemmatized_texts'])
word_index = tokenizer.word_index

embedding_dim = 100
vocab_size = len(word_index) + 1
embedding_matrix = np.zeros((vocab_size, embedding_dim))

for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

X_seq = tokenizer.texts_to_sequences(data['lemmatized_texts'])
X_pad = pad_sequences(X_seq, maxlen=100)

tokenizer: Converts words to integer indices.
num_words=5000: We limit ourselves to the top 5,000 words in the dataset.
embedding_matrix: Each row corresponds to a word in our vocabulary; columns are the vector values (100).
texts_to_sequences: Replaces words in each text with their integer representation.
pad_sequences: Ensures all sequences are of equal length (here, maxlen=100).

We can also create a feature matrix by averaging the embeddings of each token in a sequence:

def create_feature_matrix(sequences, embedding_matrix):
    features = np.zeros((sequences.shape[0], embedding_matrix.shape[1]))
    for i, seq in enumerate(sequences):
        features[i] = np.mean(embedding_matrix[seq], axis=0)
    return features

X_features = create_feature_matrix(X_pad, embedding_matrix)
print("Shape of feature matrix:", X_features.shape)

For each sequence, we average the GloVe embeddings of all tokens to get a single vector representation.
This yields (num_samples, embedding_dim) shape in X_features.

This can be used for certain ML models, though for an LSTM, we typically feed the sequences themselves.

7. Building the LSTM Model

We now define and train our LSTM model. LSTM is a special Recurrent Neural Network (RNN) well-suited for long sequences and capturing context across words.

7.1 Preparing Data for the LSTM Model

texts = data['text'].tolist()
labels = data['is_sarcastic'].tolist()

MAX_SEQUENCE_LENGTH = 30
EMBEDDING_DIM = 100

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(texts)
X_data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
y_data = np.array(labels)

We do a fresh tokenize specifically for the model (ensuring we have the correct sequence length, etc.).

MAX_SEQUENCE_LENGTH = 30: We choose 30 as the maximum number of tokens per headline. Which means we we limit sequences to 30 tokens
y_data: The binary labels (0 for not sarcastic, 1 for sarcastic).

Loading GloVe Again (For the Model)

def load_glove_embeddings(file_path, embedding_dim):
    embeddings_index = {}
    with open(file_path, 'r', encoding='utf8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

def create_embedding_matrix(word_index, embeddings_index, embedding_dim):
    embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

embeddings_index = load_glove_embeddings(file_path, EMBEDDING_DIM)
embedding_matrix = create_embedding_matrix(word_index, embeddings_index, EMBEDDING_DIM)
vocab_size = len(word_index) + 1

We repeat the creation of embeddings_index and embedding_matrix for these new tokens.

7.2 Defining the LSTM Architecture

def LSTM_RNN(vocab_size, embed_dim, embed_matrix, max_seq_len):
    embedding_layer = Embedding(vocab_size, embed_dim, weights=[embed_matrix], 
                                input_length=max_seq_len, trainable=False)

    sequence_input = Input(shape=(max_seq_len,), dtype='int32')
    embedding_sequences = embedding_layer(sequence_input)
    x = Dropout(0.2)(embedding_sequences)
    x = Conv1D(64, 5, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
    x = Bidirectional(LSTM(64, dropout=0.3, recurrent_dropout=0.2))(x)
    x = Dense(256, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
    x = Dropout(0.5)(x)
    x = Dense(128, activation='relu')(x)
    outputs = Dense(1, activation='sigmoid')(x)
    model = tf.keras.Model(sequence_input, outputs)

    model.compile(optimizer=Adam(learning_rate=1e-4), loss='binary_crossentropy', metrics=['accuracy'])
    model.summary()
    return model

lstm_model = LSTM_RNN(vocab_size, EMBEDDING_DIM, embedding_matrix, MAX_SEQUENCE_LENGTH)

Key Layers include:

Embedding(...): Initializes an embedding layer with our GloVe vectors.
trainable=False: This means we do not update these embeddings during training.
Dropout(...): Randomly sets input units to 0 to reduce overfitting.
Conv1D(...): A 1D convolution to capture local patterns in the text.
Bidirectional(LSTM(64, ...)): Our main LSTM layer with 64 units, reading the text forward and backward.
Dense(256, activation='relu'): A fully connected layer for learning higher-level features.
Dense(1, activation='sigmoid'): Outputs a probability of sarcasm (between 0 and 1).
model.compile(...): Uses the Adam optimizer with a learning rate of 1e-4.

7.3 Training the Model

batch_size = 100
epochs = 10

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=1e-5)
model_checkpoint = ModelCheckpoint('best_model.keras', save_best_only=True, monitor='val_loss')

history = lstm_model.fit(
    X_data, y_data,
    epochs=epochs,
    batch_size=batch_size,
    validation_split=0.2,
    verbose=1,
    callbacks=[early_stopping, reduce_lr, model_checkpoint]
)

batch_size: How many samples we process in one go (set to 100).
epochs: The number of times we iterate over the entire dataset (set to 10).
EarlyStopping: Stops training if validation loss doesn’t improve for 5 epochs.
ReduceLROnPlateau: Lowers the learning rate by a factor of 0.2 if no improvement in 3 epochs.
ModelCheckpoint: Saves the best model parameters to best_model.keras based on validation loss.

7.4 Saving the Model and Tokenizer

joblib.dump(tokenizer, 'tokenizer.joblib')
lstm_model.save('best_model.h5')

We save tokenizer using joblib.
We also save the trained model weights into best_model.h5.

8. Deploying with Streamlit

Now that we have a trained model, we can serve it via Streamlit for a simple web interface.

Below is the app.py (or any Python file you’ll run with streamlit run app.py):

import streamlit as st
import joblib
import json 
import tensorflow as tf
from tensorflow.keras.models import load_model, save_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np

# Load the trained model and tokenizer
@st.cache_resource
def load_resources():
    try:
        model = tf.keras.models.load_model('best_model.keras')
        tokenizer = joblib.load('tokenizer.joblib')
        return model, tokenizer
    except Exception as e:
        print(f"Error loading model: {e}")
        raise

model, tokenizer = load_resources()

# Define constants
MAX_SEQUENCE_LENGTH = 30

# Title
st.title("Sarcasm Detection Model")

# Input form
input_text = st.text_input("Enter a sentence to check for sarcasm:", "")

if st.button("Predict"):
    if input_text.strip():
        # Preprocess the input text
        sequences = tokenizer.texts_to_sequences([input_text])
        padded_seq = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

        # Predict using the model
        prediction = model.predict(padded_seq)[0][0]

        # Output prediction
        st.subheader("Prediction:")
        if prediction > 0.5:
            st.write(f"**Sarcastic** with a probability of {prediction:.2f}")
        else:
            st.write(f"**Not Sarcastic** with a probability of {1 - prediction:.2f}")
    else:
        st.warning("Please enter a valid sentence.")

st.markdown("This app uses an LSTM model trained on a sarcasm dataset.")

Explanation

@st.cache_resource: Caches the model and tokenizer so they’re loaded only once.
load_model('best_model.keras'): Loads the best model saved during training.
User Input: st.text_input(...) collects user text.
Preprocessing: Convert the text input into sequences, pad them to the same length as the training data.
Prediction: Model outputs a probability, and we decide “Sarcastic” if > 0.5.
Streamlit UI: We display the result using st.write(...) and a neat probability format.

9. Putting It All Together

Clean and preprocess your dataset in a Python script or notebook.
Train the LSTM model to detect sarcasm, ensuring you save:
best_model.keras (or best_model.h5), tokenizer.joblib
Create a Streamlit file (app.py) for deployment.
Run the following command in your terminal:

streamlit run app.py

Open the local URL displayed (usually http://localhost:8501) to access your web app.
Navigate to the displayed local URL, and you’ll see a text box where you can input sentences. The app will then tell you whether the sentence is sarcastic or not!

Here is the link to the complete code on GitHub

10. Conclusion

Congratulations! You’ve built a Sarcasm Detection application from scratch.

In this blog post, we walked through end-to-end sarcasm detection:

Data ingestion and detailed cleaning.
Embedding the text using GloVe vectors.
Building and training an LSTM network with additional convolutional layers.
Deploying to an interactive interface via Streamlit.

11. Next Steps

This process demonstrates end-to-end NLP for a challenging classification task. You can extend this workflow by:

Experiment with more advanced embeddings like BERT or ELMo to capture deeper contextual information.
Hyperparameter tuning: Adjust LSTM units, batch size, or layer configurations for better accuracy.
Data augmentation: If the dataset is small, consider collecting more sarcastic/non-sarcastic samples or using advanced augmentation strategies.
Expand domain: Apply this approach to various text sources—tweets, product reviews, etc. Add interpretability: Use libraries like SHAP or LIME to see which words most influence your model’s decisions.
Adding more data or investigating class imbalance (if one class is more frequent

I hope this tutorial helps you confidently build your own NLP applications for sarcasm detection or any other text classification problem!

Thank you for reading. Happy coding, and may your model’s sense of sarcasm improve daily! Feel free to comment or reach out if you have any questions or suggestions.

References

Downloading and Converting YouTube Videos to MP3 using yt-dlp in Python

Kehinde Abe — Sun, 06 Oct 2024 22:14:48 +0000

The popularity of video content on websites like YouTube has increased user demand for audio extraction, particularly for podcasts, lectures, and music files. While plenty of web-based options are available for downloading audio, they frequently have restrictions on access, intrusive advertisements, or a decrease. This post explains how to download and convert YouTube videos into MP3 files using Python and yt-dlp, guaranteeing a high-quality result.

Overview of `yt-dlp`

yt-dlp is a powerful, community-driven fork of youtube-dl. It includes several enhancements, and optimizations, and supports downloading from a wide range of websites, with a primary focus on YouTube. The tool provides extensive options for video/audio quality, output formats, and file handling. In this article, we'll focus on using yt-dlp to extract the best possible audio from YouTube videos and save it in MP3 format.

Prerequisites

Before diving into the code, you’ll need to ensure that your environment is set up properly.

Python: Make sure you have Python 3 installed on your machine. You can download it from Python.org.
yt-dlp: Install yt-dlp using pip:

pip install yt-dlp

FFmpeg: For post-processing (i.e., converting audio formats), FFmpeg must be installed. It is the core tool used for converting video and audio formats. You can install it via your package manager:

On macOS (using Homebrew):

brew install ffmpeg

On Windows: Download the executable from FFmpeg's official site and follow the installation instructions.

Python Script to Download YouTube Videos and Convert to MP3

Here is the Python script that performs the download and conversion of YouTube videos into MP3 format using yt-dlp and FFmpeg.

import yt_dlp
playlist_url = 'https://www.youtube.com/playlist?list=YOUR_PLAYLIST_ID'  
save_path = 'downloads/' 
def download_best_audio_as_mp3(video_url, save_path=save_path):
    ydl_opts = {
        'outtmpl': save_path + '/%(title)s.%(ext)s',  # Save path and file name
        'postprocessors': [{  # Post-process to convert to MP3
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',  # Convert to mp3
            'preferredquality': '0',  # '0' means best quality, auto-determined by source
        }],
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([video_url])
download_best_audio_as_mp3(video_url, save_path)

Key Components of the Script:

yt_dlp.YoutubeDL Object: This is the main object provided by yt-dlp that allows us to configure how the download is performed. The ydl_opts dictionary contains various options:

'outtmpl': Specifies the output template, including the directory (save_path) and filename (%(title)s).
'postprocessors': This option enables post-processing of the downloaded content, in this case, extracting audio and converting it to MP3 format.
'preferredcodec': Specifies the codec for the audio conversion, which is set to MP3.
'preferredquality': A value of '0' ensures that the best possible quality is selected automatically.

Downloading and Conversion: The download method accepts a list of video URLs, allowing multiple videos to be processed. In our example, we pass a single YouTube video URL.
MP3 Conversion: After the download, the postprocessor (FFmpegExtractAudio) converts the audio to MP3 format using FFmpeg. The preferredquality option ensures the highest available audio quality is used.

Customizing the Script

Adjusting the Save Path: You can modify the save_path variable to store the downloaded MP3 files in a different location:

save_path = '/path/to/your/folder/'

Downloading a Playlist: To download a full playlist, simply provide the playlist URL instead of a single video URL. yt-dlp will automatically handle the playlist and download each video sequentially:

video_url = 'https://www.youtube.com/playlist?list=PLxyz...'

Choosing a Different Audio Format: If you want a different format (e.g., AAC or WAV), you can change the 'preferredcodec' value:


'preferredcodec': 'aac',  # Convert to AAC format

The full code can be found through this GitHub page

Conclusion

By combining yt-dlp and FFmpeg, you can easily automate the process of downloading YouTube videos and converting them into high-quality MP3 files with minimal effort. This approach offers flexibility and power over web-based solutions, making it ideal for developers and power users who want to streamline their media processing workflows.
This script can be expanded to include more advanced functionalities like batch processing, downloading subtitles, or selecting specific audio streams. With its modular design and strong community support, yt-dlp continues to be one of the most reliable tools for managing multimedia content from the web.

TinyLlama LLM: A Step-by-Step Guide to Implementing the 1.1B Model on Google Colab

Kehinde Abe — Sat, 06 Jan 2024 18:13:42 +0000

LLM, or Large Language Model, is an advanced artificial intelligence program designed for understanding, generating, and working with human language. It's trained on a vast array of text data, allowing it to assist in tasks like answering questions, writing essays, translating languages, and even creative writing. They can be used in various applications, from chatbots to research tools, due to their ability to understand context and generate coherent and contextually relevant responses.

In this tutorial, you’ll learn the following:

Introduction and Overview of TinyLlama
Understanding the System requirements
Step-by-Step implementation

Introduction

TinyLlama emerges as a standout choice in the rapidly evolving landscape of language model technology. This guide is crafted for data scientists, AI enthusiasts, and curious learners, aiming to demystify the deployment of the 1.1 billion parameter TinyLlama model. With its unique blend of power and compactness, TinyLlama reshapes expectations in machine learning, offering versatility for both local and cloud-based environments like Google Colab.
After reading this article, you'll gain insights into setting up and leveraging TinyLlama on Google Colab for your projects or explorations. We'll provide a detailed roadmap for maximizing TinyLlama's capabilities across different use cases. All resources and tools referenced are linked at the conclusion for your convenience.

Overview of the TinyLlama and its significance

TinyLlama is more than just an AI model; it's a beacon of innovation in generative AI. Trained on a staggering 3 trillion tokens, it showcases a seamless integration with numerous projects based on the Llama framework. TinyLlama's compact yet robust architecture, featuring only 1.1 billion parameters, makes it an ideal solution for applications with limited computational resources.
Notably, it shares the same architecture and tokenizer as Llama 2, ensuring high-quality and consistent performance. One notable use case of TinyLlama is in content generation, where its efficiency and accuracy have been greatly valued.

The TinyLlama project aims to pre-train a 1.1B Llama model on 3 trillion tokens. We can achieve this with proper optimization within "just" 90 days using 16 A100-40G GPUs 🚀🚀 - TinyLlama Team.

Understanding the Technical Requirements

As I dive into Large Language Models (LLMs) through TinyLlama, I must mention the basic tools and environment setups, especially for those using macOS (I use MacBook Pro M1). Though the installations are similar, I will use macOS for the tutorial.
Let me walk you through some key installations and their purposes:

Firstly, Python. This versatile programming language is the backbone of many applications, including data analysis and machine learning. The easiest way to install Python on macOS is through Homebrew. Simply open Terminal and type brew install python. If Homebrew isn't your thing, you can directly download Python from python.org

Next up, Jupyter Notebook. This is a fantastic tool for anyone involved in coding, data science, or just wanting to experiment with Python. It lets you create documents with live code and visualizations. I installed Jupyter on my Mac using Python's package manager by running a pip install notebook in the Terminal. Launching it is as simple as typing jupyter notebook. Another way to install and use python is through Anaconda. Click on the link and follow the steps.

Google Colab is another gem. It's a cloud service offering a Jupyter Notebook environment without any setup. It's particularly handy for sharing projects and accessing GPUs for free. Access it by visiting Google Colab

Some other technical requirements:

System Memory: 550MB minimum.
RAM: Up to 3.8GB for optimal performance.
Platform: Compatible with Google Colab and local setups using VScode with Jupyter Notebook or a Python file.
Google Account: Required for Google Colab access.
Google Colab Versions: Free version for development (CPU & GPU) and a Pro version for intensive computation.

With these tools and tips, setting up a robust and flexible environment for working with LLM on a macOS system becomes a smooth and efficient process.

Step-by-Step Implementation Guide

Using Colab, I will start with the CPU. To switch between both runtime environments, go to the right-hand side of the interface. You should see something like Connect. Click on it, and you will see the image below, select change runtime type, and a modal will pop up, then select the runtime you need, either CPU or GPU. You can use any of these for this tutorial.

After this, we can install the tools and libraries we need.

Implementation/Method 1

!CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

!pip3 install huggingface-hub

Note:

!CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

We needed to use this because it will build llama.cpp from the source using cmake and your system's c compiler (required) and install the library alongside this Python package. These backends are supported by llama-cpp-python and can be enabled by setting the CMAKE_ARGS environment variable before installing.

Let's install the model needed for this task. We are using the recently launched 1.1B parameter V1.0 Chat Completion model.

!huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

The next step is to import the Python Bindings for llama.cpp

from llama_cpp import Llama

Next is configuring the LLM from Llama, including the model path and other required parameters.

llm = Llama(model_path="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
            n_ctx=2048,
            n_threads=8,
            n_gpu_layers=35)

The final step for this method is to call and use the chat completion function.


llm.create_chat_completion(
      messages = [
        {
          "role": "system",
          "content": "You are story writing assistant"

        },
        {
          "role": "user",
          "content": "Write an extensive story about Life as a young Adult"
        }
      ]
)

That's it! The implementation is as simple as that.

If you followed this tutorial step by step, running the above should return the response in the image below, including the full content returned.

The response will also include other details of the result.

'finish_reason': 'length'}],
'usage': {'prompt_tokens': 36,
'completion_tokens': 476, 'total_tokens': 512}}

The full code is below, and I will add the GitHub link to the project at the end of this article.

Implementation/Method 2

I will add all the code needed here, and then you can run it locally or in Google Colab. It's almost the same process with a slight difference.

#There is always an issue of "ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`"
#so use the command below to install accelerare from the package manager.


!pip -qqq install bitsandbytes accelerate


import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

If you follow the process above, you should get a result with the entire content that looks like the one below.

GitHub link

Conclusion

In conclusion, TinyLlama indicates the advancements in AI, balancing high performance with resource efficiency. Its 1.1 billion parameters make it suitable for diverse applications, from Google Colab to local setups, ensuring user-friendly implementation for newcomers and veterans alike. The minimal hardware requirements enhance its accessibility, opening doors to innovative AI interactions and applications in various fields. This guide has highlighted TinyLlama's practical utility in chat completions and other AI tasks, supported by a robust community on platforms like GitHub and HuggingFace. As we continue to witness the evolution of AI, TinyLlama exemplifies a practical and powerful tool in this journey, making advanced machine-learning models more accessible to a broader audience.

Thank you for reading! 🦙🦙🦙🚀🚀🚀

Keep learning and Happy Coding!

You can find me on LinkedIn and Twitter(X)

References

TinyLlama GitHub project

TinyLlama-1.1B-Chat-v1.0 on HuggingFace

TinyLlama Base

Python Bindings for llama.cpp

TinyLlama Playground

DEV Community: Kehinde Abe

Building a Sarcasm Detection System with LSTM and GloVe: A Complete Guide

Table of Contents

1. Introduction

2. Tools and Environment Setup

3. Importing Libraries

4. Loading and Inspecting Data

5. Data Cleaning

5.1 Removing Special Characters

5.2 Additional Noise Removal (URLs, HTML, Non-ASCII, Punctuation)

6. Using GloVe Embeddings

6.1 Creating the Embedding Matrix

7. Building the LSTM Model

7.1 Preparing Data for the LSTM Model

7.2 Defining the LSTM Architecture

7.3 Training the Model

7.4 Saving the Model and Tokenizer

8. Deploying with Streamlit

9. Putting It All Together

10. Conclusion

11. Next Steps

References

Downloading and Converting YouTube Videos to MP3 using yt-dlp in Python

Overview of yt-dlp

Prerequisites

Python Script to Download YouTube Videos and Convert to MP3

Key Components of the Script:

Customizing the Script

Conclusion

TinyLlama LLM: A Step-by-Step Guide to Implementing the 1.1B Model on Google Colab

Introduction

Overview of the TinyLlama and its significance

Understanding the Technical Requirements

Some other technical requirements:

Step-by-Step Implementation Guide

Implementation/Method 1

Implementation/Method 2

Conclusion

References

Overview of `yt-dlp`