<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kehinde Abe</title>
    <description>The latest articles on DEV Community by Kehinde Abe (@_ken0x).</description>
    <link>https://dev.to/_ken0x</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F150895%2Fdcb28a86-3698-4c85-a7e7-cbc4a9e7b444.png</url>
      <title>DEV Community: Kehinde Abe</title>
      <link>https://dev.to/_ken0x</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/_ken0x"/>
    <language>en</language>
    <item>
      <title>Building a Sarcasm Detection System with LSTM and GloVe: A Complete Guide</title>
      <dc:creator>Kehinde Abe</dc:creator>
      <pubDate>Thu, 02 Jan 2025 10:28:53 +0000</pubDate>
      <link>https://dev.to/_ken0x/building-a-sarcasm-detection-system-with-lstm-and-glove-a-complete-guide-m1h</link>
      <guid>https://dev.to/_ken0x/building-a-sarcasm-detection-system-with-lstm-and-glove-a-complete-guide-m1h</guid>
      <description>&lt;p&gt;Before We Begin :)&lt;/p&gt;

&lt;p&gt;Detecting sarcasm is more than just spotting ironic statements. It involves understanding tone, context, and sometimes even cultural nuances. Sarcasm can be difficult for machines to detect in social media posts, news headlines, or everyday conversations because it contradicts the literal meaning of words. Yet, modern NLP techniques can pick up on these subtleties better than ever with the right approach and data preprocessing.&lt;/p&gt;

&lt;p&gt;Below, you’ll find a detailed, step-by-step guide on how to build your sarcasm detection model using LSTM (Long Short-Term Memory) networks and GloVe embeddings. From data cleaning and preprocessing to model deployment in a Streamlit application, this post covers every element you need to create a robust sarcasm detection system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Tools and Environment Setup&lt;/li&gt;
&lt;li&gt;Importing Libraries&lt;/li&gt;
&lt;li&gt;Loading and Inspecting Data&lt;/li&gt;
&lt;li&gt;Data Cleaning&lt;/li&gt;
&lt;li&gt;Removing Special Characters&lt;/li&gt;
&lt;li&gt;Additional Noise Removal (URLs, HTML, Non-ASCII, Punctuation)&lt;/li&gt;
&lt;li&gt;Handling Slang, Acronyms, and Common Abbreviations&lt;/li&gt;
&lt;li&gt;Stopword Removal and Lemmatization&lt;/li&gt;
&lt;li&gt;Using GloVe Embeddings&lt;/li&gt;
&lt;li&gt;Creating the Embedding Matrix&lt;/li&gt;
&lt;li&gt;Creating Feature Vectors&lt;/li&gt;
&lt;li&gt;Building the LSTM Model&lt;/li&gt;
&lt;li&gt;Preparing Data for the LSTM Model&lt;/li&gt;
&lt;li&gt;Defining the LSTM Architecture&lt;/li&gt;
&lt;li&gt;Training the Model&lt;/li&gt;
&lt;li&gt;Saving the Model and Tokenizer&lt;/li&gt;
&lt;li&gt;Deployment with Streamlit&lt;/li&gt;
&lt;li&gt;Putting It All Together&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;li&gt;Next Steps&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;Sarcasm detection is a fascinating natural language processing (NLP) challenge. Sarcastic statements often convey the opposite of their literal meaning, making them tricky for machines to identify. For instance, the sentence “I love getting stuck in traffic for hours” may say you enjoy traffic, but in reality, you mean the opposite. Automated sarcasm detection requires models that can glean subtle contextual cues. In this post, we’ll train an LSTM model on a sarcasm headlines dataset and deploy it using Streamlit to create a friendly, interactive web interface.&lt;/p&gt;

&lt;p&gt;This step-by-step guide will show you how to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Load and analyze a sarcasm dataset&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean and preprocess the text data&lt;/strong&gt; (removing special characters, URLs, punctuation, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use GloVe embeddings&lt;/strong&gt; (a widely used word embedding technique) to represent our text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build and train an LSTM model&lt;/strong&gt; to classify whether a sentence is sarcastic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy our trained model on Streamlit&lt;/strong&gt; for a user-friendly web interface.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By the end of this tutorial, you will have a functional sarcasm-detection application!&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Tools and Environment Setup
&lt;/h2&gt;

&lt;p&gt;To run this project, you’ll need a few key tools and libraries:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.x&lt;/strong&gt; (preferably 3.7+)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pip&lt;/strong&gt; or &lt;strong&gt;conda&lt;/strong&gt; for installing packages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TensorFlow&lt;/strong&gt; and &lt;strong&gt;Keras&lt;/strong&gt; for building deep learning models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NLTK&lt;/strong&gt; for text processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GloVe embeddings&lt;/strong&gt; – specifically &lt;code&gt;glove.6B.100d.txt&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit&lt;/strong&gt; for deployment&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Recommended steps to set up a virtual environment&lt;/strong&gt; (using &lt;code&gt;pip&lt;/code&gt; and &lt;code&gt;venv&lt;/code&gt; as an example):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Create and activate virtual environment
python -m venv sarcasm-env
source sarcasm-env/bin/activate  # Linux/Mac
# or:
sarcasm-env\Scripts\activate  # Windows

# Install necessary libraries
pip install numpy pandas matplotlib plotly scikit-learn nltk tensorflow joblib streamlit xgboost lightgbm catboost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, &lt;strong&gt;download&lt;/strong&gt; the GloVe file (&lt;code&gt;glove.6B.100d.txt&lt;/code&gt;) and place it in a folder named &lt;code&gt;dataset/&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Importing Libraries
&lt;/h2&gt;

&lt;p&gt;We start by importing libraries for data manipulation and NLP tasks and building deep learning models. Below is the code snippet we use in our Jupyter Notebook or Python script. This includes everything for data manipulation, NLP tasks, and building deep learning models.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

import joblib
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import re
import itertools    
import wordcloud

# For data preprocessing
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer

# For building our Models
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Conv1D, Bidirectional, SpatialDropout1D, Dropout
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam
from sklearn.svm import LinearSVC, SVC
from sklearn.model_selection import cross_val_score, KFold, cross_val_predict, cross_validate

from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.models import Model

# For Lazy Predict (commented out)
# from lazypredict.Supervised import LazyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# For hyperparameter tuning
from scipy.stats import uniform

# Reduce dimensions to 2 for faster training
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# For creating vocabulary dictionary
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# For model evaluation
from sklearn.model_selection import LearningCurveDisplay, learning_curve
from sklearn.metrics import confusion_matrix, classification_report, log_loss, make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, auc, DetCurveDisplay, RocCurveDisplay, roc_curve, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV

# For processing texts
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What’s happening here?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;warnings.filterwarnings('ignore')&lt;/code&gt;: Suppresses any warnings for cleaner output.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pandas&lt;/code&gt;, &lt;code&gt;numpy&lt;/code&gt;, &lt;code&gt;matplotlib&lt;/code&gt;, &lt;code&gt;plotly&lt;/code&gt;, &lt;strong&gt;etc&lt;/strong&gt;.: Data handling, visualization libraries.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nltk&lt;/code&gt;, &lt;code&gt;WordNetLemmatizer&lt;/code&gt;, &lt;code&gt;stopwords&lt;/code&gt;: Common NLP libraries for text processing.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tensorflow&lt;/code&gt;, &lt;code&gt;keras&lt;/code&gt;: Building and training our deep learning model (LSTM).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sklearn&lt;/code&gt;: Traditional machine learning tools, plus utilities for train/test split, hyperparameter tuning, etc.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;joblib&lt;/code&gt;: For saving and loading models/tokenizers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. Loading and Inspecting Data
&lt;/h2&gt;

&lt;p&gt;Here, we load a CSV file that contains sarcasm headlines data. Let’s see how many records we have and if there are duplicates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = pd.read_csv('dataset/sarcasm_headlines.csv')

def check_duplicates(data):
    duplicate = data.duplicated().sum()
    return duplicate

print(check_duplicates(data))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pd.read_csv(...)&lt;/code&gt; loads our CSV data into a DataFrame named &lt;code&gt;data&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;check_duplicates&lt;/code&gt; function sums up duplicated rows (if any).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;print(check_duplicates(data))&lt;/code&gt; shows how many duplicate entries exist.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Data Cleaning
&lt;/h2&gt;

&lt;p&gt;Sarcasm detection often depends on subtle textual cues, and clean, standardized input can significantly improve model performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.1 Removing Special Characters
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Show the special characters in the text column
data[data['text'].str.contains(r'[^A-Za-z0-9 ]', regex=True)]

# Function to remove special characters
def remove_special_characters(text):
    text = re.sub(r'[^A-Za-z0-9 ]', '', text)
    return text

# Apply function to remove special characters
data['text'] = data['text'].apply(remove_special_characters)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;data['text'].str.contains(...)&lt;/code&gt;: Check if special characters exist in each row’s text.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;remove_special_characters&lt;/code&gt;: Uses a regular expression to replace all non-alphanumeric characters (excluding space) with nothing.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;apply(...)&lt;/code&gt;: This function applies to each row in the text column.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Further Checks&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def special_characters(data):
    special = data.str.contains(r'[^A-Za-z0-9 ]', regex=True).sum()
    return special

special_characters(data['text'])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This confirms if any special characters remain after we’ve removed them.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 Additional Noise Removal (URLs, HTML, Non-ASCII, Punctuation)
&lt;/h3&gt;

&lt;p&gt;Here, we define small utility functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def remove_URL(text):
    return re.sub(r"https?://\S+|www\.\S+", "", text)

def remove_html(text):
    html = re.compile(r"&amp;lt;.*?&amp;gt;|&amp;amp;([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
    return re.sub(html, "", text)

def remove_non_ascii(text):
    return re.sub(r'[^\x00-\x7f]',r'', text)

def remove_punct(text):
    return re.sub(r'[]!"$%&amp;amp;\'()*+,./:;=#@?[\\^_`{|}~-]+', "", text)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;remove_URL&lt;/code&gt;: Removes URLs.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;remove_html&lt;/code&gt;: Removes HTML tags or entities.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;remove_non_ascii&lt;/code&gt;: Removes non-ASCII characters.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;remove_punct&lt;/code&gt;: Removes punctuation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Additional Slang, Acronyms, and Common Abbreviations&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def other_clean(text):
    # Contains dictionaries of slang, acronyms, abbreviations
    # Replaces them with their expanded forms
    ...
    return text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This function corrects typos and expands acronyms like “wtf” to “what the fuck” to make the text more standardized.&lt;/p&gt;

&lt;p&gt;Finally, we apply them all:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data["text"] = data["text"].apply(lambda x: remove_URL(x))
data["text"] = data["text"].apply(lambda x: remove_html(x))
data["text"] = data["text"].apply(lambda x: remove_non_ascii(x))
data["text"] = data["text"].apply(lambda x: remove_punct(x))
data["text"] = data["text"].apply(lambda x: other_clean(x))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stopword Removal and Lemmatization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stopwords&lt;/strong&gt; like “the,” “is,” “at” often don’t contribute much to classification. &lt;strong&gt;Lemmatization&lt;/strong&gt; ensures words are reduced to their base form.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nltk.download('stopwords')
nltk.download('wordnet')
stop = stopwords.words('english')

data['removed_stopwords'] = data['text'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (stop)])
)

def lemmatized_text(corpus):
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

data['lemmatized_texts'] = lemmatized_text(data['removed_stopwords'])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, each headline is clean, de-noised, and lemmatized, forming our final text data.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Using GloVe Embeddings
&lt;/h3&gt;

&lt;p&gt;We’ll use &lt;strong&gt;GloVe&lt;/strong&gt; (Global Vectors for Word Representation) as our word embeddings, specifically the &lt;code&gt;glove.6B.100d.txt&lt;/code&gt; file. You must have this file in your dataset/ directory for the code to work.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os

file_path = "dataset/glove.6B.100d.txt"
print("File exists:", os.path.isfile(file_path))

def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, encoding="utf8") as file:
        for line in file:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = vector
    return embeddings_index

if os.path.isfile(file_path):
    embeddings_index = load_glove_embeddings(file_path)
else:
    raise FileNotFoundError("GloVe file not found. Please check the file path.")

print("Number of words in GloVe embeddings:", len(embeddings_index))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;load_glove_embeddings&lt;/strong&gt;: Reads the GloVe file line by line, splitting each line into a &lt;code&gt;word&lt;/code&gt; and its corresponding &lt;code&gt;vector&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The dictionary &lt;code&gt;embeddings_index&lt;/code&gt; holds a mapping from words -&amp;gt; 100-dimensional float vectors.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6.1 Creating the Embedding Matrix
&lt;/h3&gt;

&lt;p&gt;We’ll &lt;strong&gt;tokenize&lt;/strong&gt; our texts and build an embedding matrix with shape (vocab_size, 100):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(data['lemmatized_texts'])
word_index = tokenizer.word_index

embedding_dim = 100
vocab_size = len(word_index) + 1
embedding_matrix = np.zeros((vocab_size, embedding_dim))

for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

X_seq = tokenizer.texts_to_sequences(data['lemmatized_texts'])
X_pad = pad_sequences(X_seq, maxlen=100)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;tokenizer&lt;/code&gt;: Converts words to integer indices.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;num_words=5000&lt;/code&gt;: We limit ourselves to the top 5,000 words in the dataset.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;embedding_matrix&lt;/code&gt;: Each row corresponds to a word in our vocabulary; columns are the vector values (100).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;texts_to_sequences&lt;/code&gt;: Replaces words in each text with their integer representation.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pad_sequences&lt;/code&gt;: Ensures all sequences are of equal length (here, maxlen=100).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can also create a &lt;strong&gt;feature matrix&lt;/strong&gt; by averaging the embeddings of each token in a sequence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def create_feature_matrix(sequences, embedding_matrix):
    features = np.zeros((sequences.shape[0], embedding_matrix.shape[1]))
    for i, seq in enumerate(sequences):
        features[i] = np.mean(embedding_matrix[seq], axis=0)
    return features

X_features = create_feature_matrix(X_pad, embedding_matrix)
print("Shape of feature matrix:", X_features.shape)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;For each sequence, we average the GloVe embeddings of all tokens to get a single vector representation.&lt;/li&gt;
&lt;li&gt;This yields &lt;code&gt;(num_samples, embedding_dim)&lt;/code&gt; shape in &lt;code&gt;X_features&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This can be used for certain ML models, though for an LSTM, we typically feed the &lt;strong&gt;sequences&lt;/strong&gt; themselves.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Building the LSTM Model
&lt;/h2&gt;

&lt;p&gt;We now define and train our LSTM model. LSTM is a special Recurrent Neural Network (RNN) well-suited for long sequences and capturing context across words.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.1 Preparing Data for the LSTM Model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;texts = data['text'].tolist()
labels = data['is_sarcastic'].tolist()

MAX_SEQUENCE_LENGTH = 30
EMBEDDING_DIM = 100

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(texts)
X_data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
y_data = np.array(labels)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We do a fresh tokenize specifically for the model (ensuring we have the correct sequence length, etc.).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;MAX_SEQUENCE_LENGTH = 30&lt;/code&gt;: We choose 30 as the maximum number of tokens per headline. Which means we we limit sequences to 30 tokens&lt;/li&gt;
&lt;li&gt;y_data: The binary labels (0 for not sarcastic, 1 for sarcastic).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Loading GloVe Again (For the Model)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def load_glove_embeddings(file_path, embedding_dim):
    embeddings_index = {}
    with open(file_path, 'r', encoding='utf8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

def create_embedding_matrix(word_index, embeddings_index, embedding_dim):
    embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

embeddings_index = load_glove_embeddings(file_path, EMBEDDING_DIM)
embedding_matrix = create_embedding_matrix(word_index, embeddings_index, EMBEDDING_DIM)
vocab_size = len(word_index) + 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We repeat the creation of &lt;code&gt;embeddings_index&lt;/code&gt; and &lt;code&gt;embedding_matrix&lt;/code&gt; for these new tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.2 Defining the LSTM Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def LSTM_RNN(vocab_size, embed_dim, embed_matrix, max_seq_len):
    embedding_layer = Embedding(vocab_size, embed_dim, weights=[embed_matrix], 
                                input_length=max_seq_len, trainable=False)

    sequence_input = Input(shape=(max_seq_len,), dtype='int32')
    embedding_sequences = embedding_layer(sequence_input)
    x = Dropout(0.2)(embedding_sequences)
    x = Conv1D(64, 5, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
    x = Bidirectional(LSTM(64, dropout=0.3, recurrent_dropout=0.2))(x)
    x = Dense(256, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
    x = Dropout(0.5)(x)
    x = Dense(128, activation='relu')(x)
    outputs = Dense(1, activation='sigmoid')(x)
    model = tf.keras.Model(sequence_input, outputs)

    model.compile(optimizer=Adam(learning_rate=1e-4), loss='binary_crossentropy', metrics=['accuracy'])
    model.summary()
    return model

lstm_model = LSTM_RNN(vocab_size, EMBEDDING_DIM, embedding_matrix, MAX_SEQUENCE_LENGTH)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key Layers include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Embedding(...)&lt;/code&gt;: Initializes an embedding layer with our GloVe vectors.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;trainable=False&lt;/code&gt;: This means we do not update these embeddings during training.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Dropout(...)&lt;/code&gt;: Randomly sets input units to 0 to reduce overfitting.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Conv1D(...)&lt;/code&gt;: A 1D convolution to capture local patterns in the text.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Bidirectional(LSTM(64, ...))&lt;/code&gt;: Our main LSTM layer with 64 units, reading the text forward and backward.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Dense(256, activation='relu')&lt;/code&gt;: A fully connected layer for learning higher-level features.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Dense(1, activation='sigmoid')&lt;/code&gt;: Outputs a probability of sarcasm (between 0 and 1).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;model.compile(...)&lt;/code&gt;: Uses the &lt;strong&gt;Adam&lt;/strong&gt; optimizer with a learning rate of &lt;code&gt;1e-4&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7.3 Training the Model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;batch_size = 100
epochs = 10

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=1e-5)
model_checkpoint = ModelCheckpoint('best_model.keras', save_best_only=True, monitor='val_loss')

history = lstm_model.fit(
    X_data, y_data,
    epochs=epochs,
    batch_size=batch_size,
    validation_split=0.2,
    verbose=1,
    callbacks=[early_stopping, reduce_lr, model_checkpoint]
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;batch_size&lt;/code&gt;: How many samples we process in one go (set to 100).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;epochs&lt;/code&gt;: The number of times we iterate over the entire dataset (set to 10).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EarlyStopping&lt;/code&gt;: Stops training if validation loss doesn’t improve for 5 epochs.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ReduceLROnPlateau&lt;/code&gt;: Lowers the learning rate by a factor of 0.2 if no improvement in 3 epochs.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ModelCheckpoint&lt;/code&gt;: Saves the best model parameters to &lt;code&gt;best_model.keras&lt;/code&gt; based on validation loss.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7.4 Saving the Model and Tokenizer
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;joblib.dump(tokenizer, 'tokenizer.joblib')
lstm_model.save('best_model.h5')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;We save &lt;code&gt;tokenizer&lt;/code&gt; using &lt;code&gt;joblib&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;We also save the trained model weights into &lt;code&gt;best_model.h5&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  8. Deploying with Streamlit
&lt;/h2&gt;

&lt;p&gt;Now that we have a trained model, we can serve it via &lt;strong&gt;Streamlit&lt;/strong&gt; for a simple web interface.&lt;/p&gt;

&lt;p&gt;Below is the &lt;code&gt;app.py&lt;/code&gt; (or any Python file you’ll run with &lt;code&gt;streamlit run app.py&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import streamlit as st
import joblib
import json 
import tensorflow as tf
from tensorflow.keras.models import load_model, save_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np

# Load the trained model and tokenizer
@st.cache_resource
def load_resources():
    try:
        model = tf.keras.models.load_model('best_model.keras')
        tokenizer = joblib.load('tokenizer.joblib')
        return model, tokenizer
    except Exception as e:
        print(f"Error loading model: {e}")
        raise

model, tokenizer = load_resources()

# Define constants
MAX_SEQUENCE_LENGTH = 30

# Title
st.title("Sarcasm Detection Model")

# Input form
input_text = st.text_input("Enter a sentence to check for sarcasm:", "")

if st.button("Predict"):
    if input_text.strip():
        # Preprocess the input text
        sequences = tokenizer.texts_to_sequences([input_text])
        padded_seq = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

        # Predict using the model
        prediction = model.predict(padded_seq)[0][0]

        # Output prediction
        st.subheader("Prediction:")
        if prediction &amp;gt; 0.5:
            st.write(f"**Sarcastic** with a probability of {prediction:.2f}")
        else:
            st.write(f"**Not Sarcastic** with a probability of {1 - prediction:.2f}")
    else:
        st.warning("Please enter a valid sentence.")

st.markdown("This app uses an LSTM model trained on a sarcasm dataset.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;@st.cache_resource&lt;/code&gt;: Caches the model and tokenizer so they’re loaded only once.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load_model('best_model.keras')&lt;/code&gt;: Loads the best model saved during training.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User Input&lt;/strong&gt;: &lt;code&gt;st.text_input(...)&lt;/code&gt; collects user text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preprocessing&lt;/strong&gt;: Convert the text input into sequences, pad them to the same length as the training data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prediction&lt;/strong&gt;: Model outputs a probability, and we decide “Sarcastic” if &amp;gt; 0.5.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit UI&lt;/strong&gt;: We display the result using &lt;code&gt;st.write(...)&lt;/code&gt; and a neat probability format.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  9. Putting It All Together
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Clean and preprocess&lt;/strong&gt; your dataset in a Python script or notebook.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Train&lt;/strong&gt; the LSTM model to detect sarcasm, ensuring you save:&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;best_model.keras (or best_model.h5)&lt;/code&gt;, &lt;code&gt;tokenizer.joblib&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Create a &lt;strong&gt;Streamlit&lt;/strong&gt; file (&lt;code&gt;app.py&lt;/code&gt;) for deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run&lt;/strong&gt; the following command in your terminal:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;streamlit run app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Open the local &lt;strong&gt;URL&lt;/strong&gt; displayed (usually &lt;code&gt;http://localhost:8501&lt;/code&gt;) to access your web app.&lt;/li&gt;
&lt;li&gt;Navigate to the displayed local URL, and you’ll see a text box where you can input sentences. The app will then tell you whether the sentence is sarcastic or not!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Here is the link to the complete code on&lt;/strong&gt; &lt;a href="https://github.com/kennyOlakunle/sarcasm_detection" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  10. Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Congratulations!&lt;/strong&gt; You’ve built a &lt;strong&gt;Sarcasm Detection&lt;/strong&gt; application from scratch.&lt;/p&gt;

&lt;p&gt;In this blog post, we walked through &lt;strong&gt;end-to-end sarcasm detection&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data ingestion and detailed &lt;strong&gt;cleaning&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding&lt;/strong&gt; the text using GloVe vectors.&lt;/li&gt;
&lt;li&gt;Building and &lt;strong&gt;training&lt;/strong&gt; an LSTM network with additional convolutional layers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploying&lt;/strong&gt; to an interactive interface via &lt;strong&gt;Streamlit&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  11. Next Steps
&lt;/h2&gt;

&lt;p&gt;This process demonstrates &lt;strong&gt;end-to-end NLP&lt;/strong&gt; for a challenging classification task. You can extend this workflow by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Experiment with more advanced embeddings&lt;/strong&gt; like &lt;strong&gt;BERT&lt;/strong&gt; or &lt;strong&gt;ELMo&lt;/strong&gt; to capture deeper contextual information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hyperparameter tuning&lt;/strong&gt;: Adjust LSTM units, batch size, or layer configurations for better accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data augmentation&lt;/strong&gt;: If the dataset is small, consider collecting more sarcastic/non-sarcastic samples or using advanced augmentation strategies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expand domain&lt;/strong&gt;: Apply this approach to various text sources—tweets, product reviews, etc.
Add interpretability: Use libraries like SHAP or LIME to see which words most influence your model’s decisions.&lt;/li&gt;
&lt;li&gt;Adding more data or investigating &lt;strong&gt;class imbalance&lt;/strong&gt; (if one class is more frequent&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I hope this tutorial helps you confidently build your own &lt;strong&gt;NLP&lt;/strong&gt; applications for sarcasm detection or any other text classification problem!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thank you for reading&lt;/strong&gt;. Happy coding, and may your model’s sense of sarcasm improve daily! Feel free to comment or reach out if you have any questions or suggestions.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection" rel="noopener noreferrer"&gt;Sarcasm Headlines Dataset (original Kaggle dataset)&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://nlp.stanford.edu/projects/glove/" rel="noopener noreferrer"&gt;GloVe: Global Vectors for Word Representation:&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://numpy.org/" rel="noopener noreferrer"&gt;Numpy&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://pandas.pydata.org/" rel="noopener noreferrer"&gt;Pandas&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.nltk.org/" rel="noopener noreferrer"&gt;NLTK (Natural Language Toolkit)&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.tensorflow.org/" rel="noopener noreferrer"&gt;TensorFlow&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://keras.io/" rel="noopener noreferrer"&gt;Keras API reference&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://streamlit.io/" rel="noopener noreferrer"&gt;Streamlit&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://scikit-learn.org/stable/" rel="noopener noreferrer"&gt;Scikit-learn&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://joblib.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;Joblib&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>nlp</category>
      <category>lstm</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Downloading and Converting YouTube Videos to MP3 using yt-dlp in Python</title>
      <dc:creator>Kehinde Abe</dc:creator>
      <pubDate>Sun, 06 Oct 2024 22:14:48 +0000</pubDate>
      <link>https://dev.to/_ken0x/downloading-and-converting-youtube-videos-to-mp3-using-yt-dlp-in-python-20c5</link>
      <guid>https://dev.to/_ken0x/downloading-and-converting-youtube-videos-to-mp3-using-yt-dlp-in-python-20c5</guid>
      <description>&lt;p&gt;The popularity of video content on websites like YouTube has increased user demand for audio extraction, particularly for podcasts, lectures, and music files. While plenty of web-based options are available for downloading audio, they frequently have restrictions on access, intrusive advertisements, or a decrease. This post explains how to download and convert YouTube videos into MP3 files using Python and yt-dlp, guaranteeing a high-quality result.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview of &lt;code&gt;yt-dlp&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;yt-dlp&lt;/code&gt; is a powerful, community-driven fork of &lt;code&gt;youtube-dl&lt;/code&gt;. It includes several enhancements, and optimizations, and supports downloading from a wide range of websites, with a primary focus on YouTube. The tool provides extensive options for video/audio quality, output formats, and file handling. In this article, we'll focus on using &lt;code&gt;yt-dlp&lt;/code&gt; to extract the best possible audio from YouTube videos and save it in MP3 format.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before diving into the code, you’ll need to ensure that your environment is set up properly.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt;: Make sure you have Python 3 installed on your machine. You can download it from &lt;a href="https://www.python.org/" rel="noopener noreferrer"&gt;Python.org&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;yt-dlp&lt;/strong&gt;: Install &lt;code&gt;yt-dlp&lt;/code&gt; using &lt;code&gt;pip&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;pip install yt-dlp&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;FFmpeg&lt;/strong&gt;: For post-processing (i.e., converting audio formats), &lt;code&gt;FFmpeg&lt;/code&gt; must be installed. It is the core tool used for converting video and audio formats. You can install it via your package manager:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;On macOS (using Homebrew):
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;brew install ffmpeg

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Windows: Download the executable from &lt;a href="https://ffmpeg.org/download.html" rel="noopener noreferrer"&gt;FFmpeg's official site&lt;/a&gt; and follow the installation instructions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Python Script to Download YouTube Videos and Convert to MP3
&lt;/h2&gt;

&lt;p&gt;Here is the Python script that performs the download and conversion of YouTube videos into MP3 format using &lt;code&gt;yt-dlp&lt;/code&gt; and &lt;code&gt;FFmpeg&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import yt_dlp
playlist_url = 'https://www.youtube.com/playlist?list=YOUR_PLAYLIST_ID'  
save_path = 'downloads/' 
def download_best_audio_as_mp3(video_url, save_path=save_path):
    ydl_opts = {
        'outtmpl': save_path + '/%(title)s.%(ext)s',  # Save path and file name
        'postprocessors': [{  # Post-process to convert to MP3
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',  # Convert to mp3
            'preferredquality': '0',  # '0' means best quality, auto-determined by source
        }],
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([video_url])
download_best_audio_as_mp3(video_url, save_path)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Components of the Script:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;yt_dlp.YoutubeDL&lt;/code&gt; &lt;strong&gt;Object&lt;/strong&gt;: This is the main object provided by &lt;code&gt;yt-dlp&lt;/code&gt; that allows us to configure how the download is performed. The &lt;code&gt;ydl_opts&lt;/code&gt; dictionary contains various options:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;'outtmpl'&lt;/code&gt;: Specifies the output template, including the directory &lt;code&gt;(save_path)&lt;/code&gt; and filename &lt;code&gt;(%(title)s)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;'postprocessors'&lt;/code&gt;: This option enables post-processing of the downloaded content, in this case, extracting audio and converting it to MP3 format.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;'preferredcodec'&lt;/code&gt;: Specifies the codec for the audio conversion, which is set to MP3.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;'preferredquality'&lt;/code&gt;: A value of &lt;code&gt;'0'&lt;/code&gt; ensures that the best possible quality is selected automatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Downloading and Conversion&lt;/strong&gt;: The &lt;code&gt;download&lt;/code&gt; method accepts a list of video URLs, allowing multiple videos to be processed. In our example, we pass a single YouTube video URL.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MP3 Conversion&lt;/strong&gt;: After the &lt;code&gt;download&lt;/code&gt;, the postprocessor &lt;code&gt;(FFmpegExtractAudio)&lt;/code&gt; converts the audio to MP3 format using &lt;code&gt;FFmpeg&lt;/code&gt;. The &lt;code&gt;preferredquality&lt;/code&gt; option ensures the highest available audio quality is used.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Customizing the Script
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Adjusting the Save Path&lt;/strong&gt;: You can modify the &lt;code&gt;save_path&lt;/code&gt; variable to store the downloaded MP3 files in a different location:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;save_path = '/path/to/your/folder/'

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Downloading a Playlist&lt;/strong&gt;: To download a full playlist, simply provide the playlist URL instead of a single video URL. &lt;code&gt;yt-dlp&lt;/code&gt; will automatically handle the playlist and download each video sequentially:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;video_url = 'https://www.youtube.com/playlist?list=PLxyz...'

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Choosing a Different Audio Format&lt;/strong&gt;:
If you want a different format (e.g., AAC or WAV), you can change the &lt;code&gt;'preferredcodec'&lt;/code&gt; value:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
'preferredcodec': 'aac',  # Convert to AAC format

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full code can be found through this &lt;a href="https://github.com/kennyOlakunle/youtubedownload" rel="noopener noreferrer"&gt;GitHub page&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By combining &lt;code&gt;yt-dlp&lt;/code&gt; and &lt;code&gt;FFmpeg&lt;/code&gt;, you can easily automate the process of downloading YouTube videos and converting them into high-quality MP3 files with minimal effort. This approach offers flexibility and power over web-based solutions, making it ideal for developers and power users who want to streamline their media processing workflows.&lt;br&gt;
This script can be expanded to include more advanced functionalities like batch processing, downloading subtitles, or selecting specific audio streams. With its modular design and strong community support, &lt;code&gt;yt-dlp&lt;/code&gt; continues to be one of the most reliable tools for managing multimedia content from the web.&lt;/p&gt;

</description>
      <category>python</category>
      <category>automation</category>
      <category>webdev</category>
      <category>coding</category>
    </item>
    <item>
      <title>TinyLlama LLM: A Step-by-Step Guide to Implementing the 1.1B Model on Google Colab</title>
      <dc:creator>Kehinde Abe</dc:creator>
      <pubDate>Sat, 06 Jan 2024 18:13:42 +0000</pubDate>
      <link>https://dev.to/_ken0x/tinyllama-llm-a-step-by-step-guide-to-implementing-the-11b-model-on-google-colab-1pjh</link>
      <guid>https://dev.to/_ken0x/tinyllama-llm-a-step-by-step-guide-to-implementing-the-11b-model-on-google-colab-1pjh</guid>
      <description>&lt;p&gt;LLM, or Large Language Model, is an advanced artificial intelligence program designed for understanding, generating, and working with human language. It's trained on a vast array of text data, allowing it to assist in tasks like answering questions, writing essays, translating languages, and even creative writing. They can be used in various applications, from chatbots to research tools, due to their ability to understand context and generate coherent and contextually relevant responses.&lt;/p&gt;

&lt;p&gt;In this tutorial, you’ll learn the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Introduction and Overview of TinyLlama&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Understanding the System requirements&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Step-by-Step implementation&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;TinyLlama emerges as a standout choice in the rapidly evolving landscape of language model technology. This guide is crafted for data scientists, AI enthusiasts, and curious learners, aiming to demystify the deployment of the 1.1 billion parameter TinyLlama model. With its unique blend of power and compactness, TinyLlama reshapes expectations in machine learning, offering versatility for both local and cloud-based environments like Google Colab. &lt;br&gt;
After reading this article, you'll gain insights into setting up and leveraging TinyLlama on Google Colab for your projects or explorations. We'll provide a detailed roadmap for maximizing TinyLlama's capabilities across different use cases. All resources and tools referenced are linked at the conclusion for your convenience.&lt;/p&gt;
&lt;h3&gt;
  
  
  Overview of the TinyLlama and its significance
&lt;/h3&gt;

&lt;p&gt;TinyLlama is more than just an AI model; it's a beacon of innovation in generative AI. Trained on a staggering 3 trillion tokens, it showcases a seamless integration with numerous projects based on the Llama framework. TinyLlama's compact yet robust architecture, featuring only 1.1 billion parameters, makes it an ideal solution for applications with limited computational resources. &lt;br&gt;
Notably, it shares the same architecture and tokenizer as Llama 2, ensuring high-quality and consistent performance. One notable use case of TinyLlama is in content generation, where its efficiency and accuracy have been greatly valued.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The TinyLlama project aims to pre-train a 1.1B Llama model on 3 trillion tokens. We can achieve this with proper optimization within "just" 90 days using 16 A100-40G GPUs 🚀🚀 - TinyLlama Team.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Understanding the Technical Requirements
&lt;/h3&gt;

&lt;p&gt;As I dive into Large Language Models (LLMs) through TinyLlama, I must mention the basic tools and environment setups, especially for those using macOS (I use MacBook Pro M1). Though the installations are similar, I will use macOS for the tutorial. &lt;br&gt;
Let me walk you through some key installations and their purposes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Firstly, Python&lt;/strong&gt;. This versatile programming language is the backbone of many applications, including data analysis and machine learning. The easiest way to install Python on macOS is through Homebrew. Simply open Terminal and type &lt;code&gt;brew install python&lt;/code&gt;. If Homebrew isn't your thing, you can directly download Python from &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;python.org&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next up, Jupyter Notebook&lt;/strong&gt;. This is a fantastic tool for anyone involved in coding, data science, or just wanting to experiment with Python. It lets you create documents with live code and visualizations. I installed Jupyter on my Mac using Python's package manager by running a &lt;code&gt;pip install notebook&lt;/code&gt; in the Terminal. Launching it is as simple as typing &lt;code&gt;jupyter notebook&lt;/code&gt;. Another way to install and use python is through &lt;a href="https://docs.anaconda.com/free/anaconda/getting-started/index.html" rel="noopener noreferrer"&gt;Anaconda&lt;/a&gt;. Click on the link and follow the steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google Colab is another gem&lt;/strong&gt;. It's a cloud service offering a Jupyter Notebook environment without any setup. It's particularly handy for sharing projects and accessing GPUs for free. Access it by visiting &lt;a href="//colab.research.google.com"&gt;Google Colab&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Some other technical requirements:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;System Memory:&lt;/strong&gt; 550MB minimum.&lt;br&gt;
&lt;strong&gt;RAM:&lt;/strong&gt; Up to 3.8GB for optimal performance.&lt;br&gt;
&lt;strong&gt;Platform:&lt;/strong&gt; Compatible with Google Colab and local setups using VScode with Jupyter Notebook or a Python file.&lt;br&gt;
&lt;strong&gt;Google Account:&lt;/strong&gt; Required for Google Colab access.&lt;br&gt;
&lt;strong&gt;Google Colab Versions:&lt;/strong&gt; Free version for development (CPU &amp;amp; GPU) and a Pro version for intensive computation.&lt;/p&gt;

&lt;p&gt;With these tools and tips, setting up a robust and flexible environment for working with LLM on a macOS system becomes a smooth and efficient process.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step-by-Step Implementation Guide
&lt;/h3&gt;

&lt;p&gt;Using Colab, I will start with the CPU. To switch between both runtime environments, go to the right-hand side of the interface. You should see something like Connect. Click on it, and you will see the image below, select change runtime type, and a modal will pop up, then select the runtime you need, either CPU or GPU. You can use any of these for this tutorial.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqv1iw71qirfp3qfi8tg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqv1iw71qirfp3qfi8tg.png" alt="Change runtime environment" width="688" height="606"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyiatkg16qveliep428g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyiatkg16qveliep428g.png" alt="Select CPU or T4 GPU environment" width="800" height="534"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After this, we can install the tools and libraries we need.&lt;/p&gt;
&lt;h4&gt;
  
  
  Implementation/Method 1
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!pip3 install huggingface-hub

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Note:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We needed to use this because it will build &lt;code&gt;llama.cpp&lt;/code&gt; from the source using cmake and your system's c compiler (required) and install the library alongside this Python package. These backends are supported by llama-cpp-python and can be enabled by setting the CMAKE_ARGS environment variable before installing.&lt;/p&gt;

&lt;p&gt;Let's install the model needed for this task. We are using the recently launched 1.1B parameter V1.0 Chat Completion model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is to import the Python Bindings for &lt;code&gt;llama.cpp&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from llama_cpp import Llama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next is configuring the LLM from Llama, including the model path and other required parameters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llm = Llama(model_path="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
            n_ctx=2048,
            n_threads=8,
            n_gpu_layers=35)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final step for this method is to call and use the chat completion function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
llm.create_chat_completion(
      messages = [
        {
          "role": "system",
          "content": "You are story writing assistant"

        },
        {
          "role": "user",
          "content": "Write an extensive story about Life as a young Adult"
        }
      ]
)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it! The implementation is as simple as that.&lt;/p&gt;

&lt;p&gt;If you followed this tutorial step by step, running the above should return the response in the image below, including the full content returned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcjkrm0pshwgqfsmct0tb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcjkrm0pshwgqfsmct0tb.png" alt="Response" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The response will also include other details of the result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'finish_reason': 'length'}],
'usage': {'prompt_tokens': 36,
'completion_tokens': 476, 'total_tokens': 512}}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full code is below, and I will add the GitHub link to the project at the end of this article.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6g0pi6ri4a510w66vah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6g0pi6ri4a510w66vah.png" alt="Full code to method 1" width="800" height="944"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Implementation/Method 2
&lt;/h4&gt;

&lt;p&gt;I will add all the code needed here, and then you can run it locally or in Google Colab. It's almost the same process with a slight difference.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#There is always an issue of "ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`"
#so use the command below to install accelerare from the package manager.


!pip -qqq install bitsandbytes accelerate


import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you follow the process above, you should get a result with the entire content that looks like the one below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgk0sbpvy3wewupmd4i1u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgk0sbpvy3wewupmd4i1u.png" alt="Method 2 result" width="800" height="268"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kennyOlakunle/TinyLlama_local_implementation" rel="noopener noreferrer"&gt;GitHub link&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;In conclusion, TinyLlama indicates the advancements in AI, balancing high performance with resource efficiency. Its 1.1 billion parameters make it suitable for diverse applications, from Google Colab to local setups, ensuring user-friendly implementation for newcomers and veterans alike. The minimal hardware requirements enhance its accessibility, opening doors to innovative AI interactions and applications in various fields. This guide has highlighted TinyLlama's practical utility in chat completions and other AI tasks, supported by a robust community on platforms like GitHub and HuggingFace. As we continue to witness the evolution of AI, TinyLlama exemplifies a practical and powerful tool in this journey, making advanced machine-learning models more accessible to a broader audience.&lt;/p&gt;

&lt;p&gt;Thank you for reading! 🦙🦙🦙🚀🚀🚀&lt;/p&gt;

&lt;p&gt;Keep learning and Happy Coding!&lt;/p&gt;

&lt;p&gt;You can find me on &lt;a href="https://www.linkedin.com/in/kehindeabe/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; and &lt;a href="https://twitter.com/_Ken0x" rel="noopener noreferrer"&gt;Twitter(X)&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/jzhang38/TinyLlama?tab=readme-ov-file" rel="noopener noreferrer"&gt;TinyLlama GitHub project&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0" rel="noopener noreferrer"&gt;TinyLlama-1.1B-Chat-v1.0 on HuggingFace&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T" rel="noopener noreferrer"&gt;TinyLlama Base&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/abetlen/llama-cpp-python" rel="noopener noreferrer"&gt;Python Bindings for llama.cpp&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/spaces/PY007/TinyLlama-Chat" rel="noopener noreferrer"&gt;TinyLlama Playground&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
