<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mohd Uwaish</title>
    <description>The latest articles on DEV Community by Mohd Uwaish (@mohduwaish59).</description>
    <link>https://dev.to/mohduwaish59</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3628003%2F83b58ddf-e57b-4700-88fa-c22120b7f0cc.png</url>
      <title>DEV Community: Mohd Uwaish</title>
      <link>https://dev.to/mohduwaish59</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mohduwaish59"/>
    <language>en</language>
    <item>
      <title>FairSample: Because Class Overlap Is Harder Than Class Imbalance</title>
      <dc:creator>Mohd Uwaish</dc:creator>
      <pubDate>Tue, 10 Feb 2026 21:06:58 +0000</pubDate>
      <link>https://dev.to/mohduwaish59/fairsample-because-class-overlap-is-harder-than-class-imbalance-4pc5</link>
      <guid>https://dev.to/mohduwaish59/fairsample-because-class-overlap-is-harder-than-class-imbalance-4pc5</guid>
      <description>&lt;h2&gt;
  
  
  The Overlooked Problem in Classification
&lt;/h2&gt;

&lt;p&gt;Everyone talks about class imbalance. But there's a more insidious problem lurking in your data: &lt;strong&gt;class overlap&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.sciencedirect.com/science/article/abs/pii/S1566253522001099" rel="noopener noreferrer"&gt;Santos et al.&lt;/a&gt; argue that class overlap is a more significant impediment to classifier performance than imbalance alone. Yet most practitioners don't have tools to diagnose or address it.&lt;/p&gt;

&lt;p&gt;During my research on overlap-handling techniques, I investigated how different methods affect global structural complexity. The findings led me to build FairSample—a package specifically designed for the class overlap problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Class Overlap?
&lt;/h2&gt;

&lt;p&gt;Class overlap occurs when instances from different classes share similar feature values. Your classifier sees:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Instance A: [feature1=5.2, feature2=3.1, feature3=1.4] → Class 0
Instance B: [feature1=5.1, feature2=3.2, feature3=1.3] → Class 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These look almost identical, but belong to different classes. This confuses classifiers and degrades performance—&lt;strong&gt;even when your classes are perfectly balanced&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Overlap Is Harder Than Imbalance
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Imbalance:&lt;/strong&gt; You have 100 instances of Class A, 10 of Class B&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Sample to balance the ratio&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outcome:&lt;/strong&gt; Classifier sees both classes equally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Overlap:&lt;/strong&gt; Classes share the same feature space&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Not straightforward—you're changing data structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outcome:&lt;/strong&gt; Depends on how you handle it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why overlap requires more sophisticated analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  FairSample's Approach
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Quantify Overlap First
&lt;/h3&gt;

&lt;p&gt;Before fixing anything, measure the problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairsample.complexity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ComplexityMeasures&lt;/span&gt;

&lt;span class="n"&gt;cm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ComplexityMeasures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Get comprehensive overlap analysis
&lt;/span&gt;&lt;span class="n"&gt;all_measures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all_complexity_measures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;measures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Focus on instance overlap
&lt;/span&gt;&lt;span class="n"&gt;instance_overlap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all_complexity_measures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;measures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;N3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;N4&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kDN&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CM&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Structural complexity
&lt;/span&gt;&lt;span class="n"&gt;structural&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all_complexity_measures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;measures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;T1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LSC&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBC&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; Different overlap patterns require different solutions.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. 14+ Overlap-Handling Techniques
&lt;/h3&gt;

&lt;p&gt;FairSample implements algorithms specifically designed for overlap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EHSO&lt;/strong&gt; - Evolutionary Hybrid Sampling in Overlap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RFCL&lt;/strong&gt; - Repetitive Forward Class Learning
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NBUS&lt;/strong&gt; - Neighbourhood-Based Undersampling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;URNS&lt;/strong&gt; - Undersampling by Removing Noisy Samples&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SVDDWSMOTE&lt;/strong&gt; - Support Vector Data Description-based oversampling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OSM&lt;/strong&gt; - Overlap-based Sampling Method&lt;/li&gt;
&lt;li&gt;And more...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All from peer-reviewed research (2014-2024).&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Multi-Dimensional Evaluation
&lt;/h3&gt;

&lt;p&gt;Evaluate how techniques affect your overlap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairsample.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;compare_techniques&lt;/span&gt;

&lt;span class="c1"&gt;# Compare multiple overlap-handling techniques
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compare_techniques&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;techniques&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RFCL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;EHSO&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NBUS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;complexity_measures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;basic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# See impact on overlap metrics
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;technique&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;N3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;T1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;training_time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Before/After Validation
&lt;/h3&gt;

&lt;p&gt;Verify that overlap actually reduced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairsample&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EHSO&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairsample.complexity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;compare_pre_post_overlap&lt;/span&gt;

&lt;span class="c1"&gt;# Apply overlap-handling technique
&lt;/span&gt;&lt;span class="n"&gt;sampler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EHSO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X_resampled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_resampled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sampler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Measure structural changes
&lt;/span&gt;&lt;span class="n"&gt;comparison&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compare_pre_post_overlap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_resampled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_resampled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Overlap Reduction:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comparison&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;improvements&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Research Insights Applied
&lt;/h2&gt;

&lt;p&gt;My research investigated how overlap-handling techniques affect data structure. Key insights:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Insight 1:&lt;/strong&gt; Reducing overlap doesn't always improve classification&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some techniques reduce overlap but fragment class structure&lt;/li&gt;
&lt;li&gt;Always validate with classification metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Insight 2:&lt;/strong&gt; Different techniques, different structural effects&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some improve instance overlap but worsen structural complexity&lt;/li&gt;
&lt;li&gt;Others balance both&lt;/li&gt;
&lt;li&gt;Trade-offs vary by dataset&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Insight 3:&lt;/strong&gt; Context matters&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No universal solution exists&lt;/li&gt;
&lt;li&gt;Measure your specific overlap profile first&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These insights shaped FairSample's design—diagnostic tools before treatment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Example: Medical Diagnosis
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairsample&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RFCL&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;classification_report&lt;/span&gt;

&lt;span class="c1"&gt;# Dataset with overlapping symptoms
# Different diseases, similar presentations
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 1: Quantify overlap
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairsample.complexity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ComplexityMeasures&lt;/span&gt;
&lt;span class="n"&gt;cm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ComplexityMeasures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Instance Overlap (N3): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze_overlap&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;N3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Handle overlap
&lt;/span&gt;&lt;span class="n"&gt;sampler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RFCL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X_resampled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_resampled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sampler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Train classifier
&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_resampled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_resampled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: Evaluate
&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  40+ Complexity Measures
&lt;/h2&gt;

&lt;p&gt;FairSample provides comprehensive overlap quantification:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature Overlap:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;F1, F1v, F2, F3, F4 - How much features discriminate between classes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Instance Overlap:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;N3, N4, kDN, CM, R-value - How much instances overlap in feature space&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Structural Complexity:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;T1, LSC, DBC - How complex the decision boundary needs to be&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multiresolution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Purity, MRCA, C1, C2 - Multi-scale overlap analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each reveals different aspects of your overlap problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation &amp;amp; Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;fairsample
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Complete workflow:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairsample&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EHSO&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairsample.complexity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ComplexityMeasures&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# Load data with class overlap
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;overlapping_data.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Diagnose overlap
&lt;/span&gt;&lt;span class="n"&gt;cm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ComplexityMeasures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Overlap Analysis:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Instance Overlap (N3): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze_overlap&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;N3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Handle overlap
&lt;/span&gt;&lt;span class="n"&gt;sampler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EHSO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X_resampled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_resampled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sampler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Validate reduction
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fairsample.complexity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;compare_pre_post_overlap&lt;/span&gt;
&lt;span class="n"&gt;comparison&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compare_pre_post_overlap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_resampled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_resampled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Overlap Reduction:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comparison&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;improvements&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Use in classification
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;
&lt;span class="n"&gt;clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_resampled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_resampled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  When Overlap Matters Most
&lt;/h2&gt;

&lt;p&gt;Class overlap is particularly problematic in:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Why Overlap Occurs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Medical Diagnosis&lt;/td&gt;
&lt;td&gt;Overlapping symptoms between diseases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fraud Detection&lt;/td&gt;
&lt;td&gt;Fraudsters mimic legitimate behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Software Defect Prediction&lt;/td&gt;
&lt;td&gt;Similar code metrics for faulty/non-faulty modules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network Intrusion&lt;/td&gt;
&lt;td&gt;Attacks disguised as normal traffic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image Classification&lt;/td&gt;
&lt;td&gt;Visually similar objects in different categories&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In these domains, addressing overlap is crucial for performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overlap vs. Imbalance: A Comparison
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Scenario 1: Only Imbalance (No Overlap)
# Class 0: 1000 instances, features [0-5]
# Class 1: 100 instances, features [10-15]
# Solution: Simple resampling works well ✓
&lt;/span&gt;
&lt;span class="c1"&gt;# Scenario 2: Only Overlap (Balanced)
# Class 0: 500 instances, features [0-10]
# Class 1: 500 instances, features [5-15]
# Solution: Need overlap-handling techniques ⚠️
&lt;/span&gt;
&lt;span class="c1"&gt;# Scenario 3: Both Imbalance + Overlap
# Class 0: 1000 instances, features [0-10]
# Class 1: 100 instances, features [5-15]
# Solution: FairSample's specialized techniques ✓✓
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Research Foundation
&lt;/h2&gt;

&lt;p&gt;FairSample implements techniques from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vuttipittayamongkol &amp;amp; Elyan (2020)&lt;/strong&gt; - EHSO, NBUS - &lt;em&gt;Information Sciences&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Das et al. (2014)&lt;/strong&gt; - RFCL - &lt;em&gt;IEEE TKDE&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Santos et al. (2023)&lt;/strong&gt; - Overlap analysis framework - &lt;em&gt;Artificial Intelligence Review&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lorena et al. (2019)&lt;/strong&gt; - Complexity measures - &lt;em&gt;ACM Computing Surveys&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full citations: &lt;a href="https://github.com/mohdUwaish59/fairsample/blob/main/CITATIONS.md" rel="noopener noreferrer"&gt;CITATIONS.md&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;📖 &lt;strong&gt;Documentation&lt;/strong&gt;: &lt;a href="https://mohduwaish59.github.io/fairsample/" rel="noopener noreferrer"&gt;https://mohduwaish59.github.io/fairsample/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/mohdUwaish59/fairsample" rel="noopener noreferrer"&gt;https://github.com/mohdUwaish59/fairsample&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📝 &lt;strong&gt;Issues&lt;/strong&gt;: Report bugs or request features&lt;/li&gt;
&lt;li&gt;🔧 &lt;strong&gt;PRs&lt;/strong&gt;: Contribute techniques or improvements&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Class overlap is often more harmful than class imbalance. Yet most tools focus solely on balancing class ratios.&lt;/p&gt;

&lt;p&gt;FairSample provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Diagnostic tools&lt;/strong&gt; - Quantify overlap with 40+ measures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treatment options&lt;/strong&gt; - 14+ research-backed techniques&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation methods&lt;/strong&gt; - Verify overlap reduction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All specifically designed for the overlap problem.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Have overlap problems in your data?&lt;/strong&gt; Try FairSample and share your results! &lt;/p&gt;

&lt;p&gt;What domains have you encountered severe class overlap? Drop a comment below 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  python #machinelearning #datascience #opensource #classoverlap
&lt;/h1&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>datascience</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Built a Chrome Extension to Extract YouTube Transcripts in Bulk</title>
      <dc:creator>Mohd Uwaish</dc:creator>
      <pubDate>Mon, 24 Nov 2025 21:19:45 +0000</pubDate>
      <link>https://dev.to/mohduwaish59/i-built-a-chrome-extension-to-extract-youtube-transcripts-in-bulk-and-its-been-a-game-changer-59g9</link>
      <guid>https://dev.to/mohduwaish59/i-built-a-chrome-extension-to-extract-youtube-transcripts-in-bulk-and-its-been-a-game-changer-59g9</guid>
      <description>&lt;p&gt;Hey folks! 👋&lt;/p&gt;

&lt;p&gt;So, I had this problem. I was working on a personal project where I wanted to develop an information retrieval system from transcripts from about 300 YouTube videos. Sounds fun, right? Wrong. Try manually clicking "Show transcript" → Copy → Paste → Save as file... 300 times. Yeah, I made it through about 5 videos before I said "nope, there's gotta be a better way."&lt;/p&gt;

&lt;p&gt;Spoiler alert: there wasn't. At least not one that did exactly what I needed. So I built one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Was Real
&lt;/h2&gt;

&lt;p&gt;Here's the thing - YouTube has transcripts for most videos (thank you, auto-captions!), but getting them out is... tedious. Sure, you can click and copy one at a time, but when you're dealing with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An entire playlist of educational content&lt;/li&gt;
&lt;li&gt;All videos from a specific channel&lt;/li&gt;
&lt;li&gt;A curated list of videos for research&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...you're looking at hours of repetitive clicking. And let's be honest, we became developers specifically to avoid repetitive clicking.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;Meet the &lt;strong&gt;YouTube Transcript Extractor&lt;/strong&gt; - a Chrome extension that does three main things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Single video extraction&lt;/strong&gt; - One click, get a JSON file with the transcript&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Playlist/Channel scraping&lt;/strong&gt; - Grab all video IDs from a playlist or channel
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch processing&lt;/strong&gt; - Process dozens (or hundreds) of videos automatically&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The best part? It handles all the annoying stuff automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clicks the "Show transcript" button for you&lt;/li&gt;
&lt;li&gt;Waits for transcripts to load&lt;/li&gt;
&lt;li&gt;Adds smart delays to avoid rate limiting&lt;/li&gt;
&lt;li&gt;Retries failed extractions&lt;/li&gt;
&lt;li&gt;Gives you real-time progress updates&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How It Actually Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For a Single Video
&lt;/h3&gt;

&lt;p&gt;It's stupid simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You're on a YouTube video&lt;/li&gt;
&lt;li&gt;Click the extension icon&lt;/li&gt;
&lt;li&gt;Click "Extract Transcript"&lt;/li&gt;
&lt;li&gt;Boom - JSON file downloads&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The output looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"channel_username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"veritasium"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"video_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dQw4w9WgXcQ"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transcript"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Full transcript text here..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Perfect for feeding into your text analysis pipeline, building datasets, or just archiving content you care about.&lt;/p&gt;

&lt;h3&gt;
  
  
  For Entire Playlists
&lt;/h3&gt;

&lt;p&gt;This is where it gets fun. You give it a playlist URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.youtube.com/playlist?list=PLxxxxxx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The extension:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto-scrolls through the entire playlist&lt;/li&gt;
&lt;li&gt;Extracts all video IDs&lt;/li&gt;
&lt;li&gt;Saves them to your browser storage&lt;/li&gt;
&lt;li&gt;Lets you download them as a text file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then you can either process them immediately or save them for later. I've found this super useful for tracking new uploads from channels I follow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Batch Processing (The Real MVP)
&lt;/h3&gt;

&lt;p&gt;Here's the workflow that saves hours:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load your video IDs (from playlist extraction or manual paste)&lt;/li&gt;
&lt;li&gt;Set your batch size (I usually go with 15-20 videos)&lt;/li&gt;
&lt;li&gt;Click "Start Batch Process"&lt;/li&gt;
&lt;li&gt;Go grab coffee ☕&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The extension will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Navigate to each video automatically&lt;/li&gt;
&lt;li&gt;Extract the transcript&lt;/li&gt;
&lt;li&gt;Download it as JSON&lt;/li&gt;
&lt;li&gt;Wait 5-15 seconds (random delay to be nice to YouTube)&lt;/li&gt;
&lt;li&gt;Move to the next one&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You get real-time updates like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Processing video 23/150 (Batch 2/10)
✅ Success: 22 | ⏭️ Skipped: 1 | ❌ Failed: 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Technical Bits (For Fellow Nerds)
&lt;/h2&gt;

&lt;p&gt;Built with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manifest V3&lt;/strong&gt; (because V2 is being phased out)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chrome's Side Panel API&lt;/strong&gt; (way better UX than popups)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Scripts&lt;/strong&gt; for DOM manipulation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chrome Storage API&lt;/strong&gt; for persistence&lt;/li&gt;
&lt;li&gt;Vanilla JavaScript (keeping it simple)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some challenges I ran into:&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenge 1: The Transcript Button
&lt;/h3&gt;

&lt;p&gt;YouTube doesn't always show transcripts immediately. Sometimes you need to click a button first. My solution? The extension automatically finds and clicks it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transcriptButton&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[aria-label*="transcript"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;transcriptButton&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;transcriptButton&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="c1"&gt;// Wait for transcript to load&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Challenge 2: Rate Limiting
&lt;/h3&gt;

&lt;p&gt;YouTube isn't thrilled when you hit their servers 100 times in 5 minutes. Fair enough. So I added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Random delays (5-15 seconds between requests)&lt;/li&gt;
&lt;li&gt;Configurable batch sizes&lt;/li&gt;
&lt;li&gt;Automatic retry logic with exponential backoff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Haven't been rate-limited since. 🎉&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenge 3: Playlist Pagination
&lt;/h3&gt;

&lt;p&gt;Playlists don't load all videos at once - you have to scroll to trigger lazy loading. The extension handles this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;autoScroll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;scrollCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;maxScrolls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Safety limit&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;interval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;setInterval&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrollBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;scrollCount&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

      &lt;span class="c1"&gt;// Check if we've reached the bottom&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scrollCount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;maxScrolls&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nf"&gt;isAtBottom&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;clearInterval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;p&gt;This can be used for:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Information retrieval
&lt;/h3&gt;

&lt;p&gt;I personally worked on this use case. I Collected transcripts from 300+ videos. Extracted the information each transcript in question-answer format and converted them into a vector database for chatbot interface. Would have taken days manually - took 40 minutes with the extension.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Content Monitoring
&lt;/h3&gt;

&lt;p&gt;Track new uploads from favorite tech channels. Run it once a week, compare video IDs, process only new content. Built a simple notification system around it.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Podcast Transcription Analysis
&lt;/h3&gt;

&lt;p&gt;Many podcasts are on YouTube now. Grabbed transcripts from entire podcast series to analyze conversation patterns and topics.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Language Learning
&lt;/h3&gt;

&lt;p&gt;Downloaded transcripts from language-learning channels in my target language. Now I have a searchable corpus of natural conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gotchas
&lt;/h2&gt;

&lt;p&gt;Not everything is perfect (yet):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Some videos don't have transcripts&lt;/strong&gt; - The extension will skip these and note them in the log&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;YouTube's rate limits are real&lt;/strong&gt; - Don't try to process 500 videos in one go&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-generated transcripts aren't perfect&lt;/strong&gt; - Expect some "lol" instead of "LOL" situations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It only works in Chrome&lt;/strong&gt; - Firefox support is on my TODO list&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Want to Try It?
&lt;/h2&gt;

&lt;p&gt;The extension is open source! Here's how to get started:&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation (2 minutes)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repo&lt;/span&gt;
git clone https://github.com/yourusername/youtube-transcript-extractor.git

&lt;span class="c"&gt;# Open Chrome&lt;/span&gt;
chrome://extensions/

&lt;span class="c"&gt;# Enable Developer Mode (top right)&lt;/span&gt;
&lt;span class="c"&gt;# Click "Load unpacked"&lt;/span&gt;
&lt;span class="c"&gt;# Select the extension folder&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it! &lt;/p&gt;

&lt;h3&gt;
  
  
  Quick Test
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Go to any YouTube video&lt;/li&gt;
&lt;li&gt;Click the extension icon&lt;/li&gt;
&lt;li&gt;Click "Extract Transcript"&lt;/li&gt;
&lt;li&gt;Check your downloads folder&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You should see a JSON file. If you do, you're ready to rock!&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;I'm actively working on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Firefox support&lt;/strong&gt; - Because not everyone uses Chrome&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export formats&lt;/strong&gt; - SRT, VTT, plain text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp preservation&lt;/strong&gt; - Keep the timing data from transcripts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better error handling&lt;/strong&gt; - More descriptive error messages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progress persistence&lt;/strong&gt; - Resume batch processing after browser crash&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;This project started as a personal tool, but I'd love to make it better with your help! Whether it's:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bug reports&lt;/li&gt;
&lt;li&gt;Feature suggestions&lt;/li&gt;
&lt;li&gt;Code contributions&lt;/li&gt;
&lt;li&gt;Documentation improvements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All are welcome! Check out the &lt;a href="https://github.com/mohdUwaish59/Automated-Youtube-vidoes-transcripts-scrapper" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; and feel free to open issues or PRs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Talk: Why Build This?
&lt;/h2&gt;

&lt;p&gt;I could have probably found &lt;em&gt;something&lt;/em&gt; that did parts of what I needed. Maybe some Python script, maybe some paid service. But here's what I learned building this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Sometimes the best tool is the one you build&lt;/strong&gt; - It does exactly what you need, nothing more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Side projects teach you stuff&lt;/strong&gt; - I learned a ton about Chrome extension APIs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation is worth it&lt;/strong&gt; - Even if building it takes 10 hours, saving 20 hours is worth it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source feels good&lt;/strong&gt; - Knowing others might find this useful is cool&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Plus, it's just satisfying watching the extension churn through 300 videos while you do literally anything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;If you ever find yourself manually copying YouTube transcripts, give this extension a shot. It's not perfect, but it's saved me countless hours, and I hope it does the same for you.&lt;/p&gt;

&lt;p&gt;Got questions? Drop them in the comments! Found a bug? Please let me know - I promise I don't bite. 😊&lt;/p&gt;

&lt;p&gt;And if you build something cool with the transcripts you extract, I'd love to hear about it!&lt;/p&gt;

&lt;p&gt;Happy automating the boring stuff! 🎬📝&lt;/p&gt;

&lt;p&gt;&lt;em&gt;P.S. - If you found this useful, a star on GitHub would make my day! ⭐&lt;/em&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>automation</category>
      <category>rag</category>
      <category>frontend</category>
    </item>
  </channel>
</rss>
