<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ganiyu Olalekan</title>
    <description>The latest articles on DEV Community by Ganiyu Olalekan (@ganiyuolalekan).</description>
    <link>https://dev.to/ganiyuolalekan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F829496%2Fa9ddc9be-177d-4d35-8ee9-23dfe00dece6.jpeg</url>
      <title>DEV Community: Ganiyu Olalekan</title>
      <link>https://dev.to/ganiyuolalekan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ganiyuolalekan"/>
    <language>en</language>
    <item>
      <title>Extracting Gold from Conversations: The Hidden Challenges of Transcript Analysis</title>
      <dc:creator>Ganiyu Olalekan</dc:creator>
      <pubDate>Fri, 05 Dec 2025 09:21:16 +0000</pubDate>
      <link>https://dev.to/ganiyuolalekan/extracting-gold-from-conversations-the-hidden-challenges-of-transcript-analysis-44h7</link>
      <guid>https://dev.to/ganiyuolalekan/extracting-gold-from-conversations-the-hidden-challenges-of-transcript-analysis-44h7</guid>
      <description>&lt;p&gt;Did you know that analyzing a transcript conversation isn’t straightforward? Well, neither did I! 🤷🏽‍♂️ When I first started building analysis and evaluation products at Insight7, I quickly realized that working with conversational data presented a plethora of challenges that required more than just technical know-how. So grab your favorite cup of coffee, and let’s dive into the gold mine that is transcript analysis!&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Transcript Analysis Is Harder Than It Looks
&lt;/h2&gt;

&lt;p&gt;Conversational data is rich with insights but is often messy and unstructured. It may seem like a straightforward process—record a conversation, get a transcript, and voilà! But the reality is far more complicated. Here are some of the hidden challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compartmentalization&lt;/strong&gt;: There’s no one-size-fits-all approach to transcripts. Different types require different handling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lack of Numerical Data&lt;/strong&gt;: Conversations are text-heavy, and extracting quantifiable data is no small feat.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disjointed Transcripts&lt;/strong&gt;: Sometimes, you’ll encounter transcripts where the information is scattered, making it difficult to analyze.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Misconceptions About Transcript Analysis
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsn2awc6n90kmxjv34m1m.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsn2awc6n90kmxjv34m1m.jpg" alt="Misconceptions with conversation analysis" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Many sales and customer service teams harbor misconceptions about transcript analysis that can lead to missed opportunities. Here are a few:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AI Can Do It All&lt;/strong&gt;: A prevalent belief is that AI can process insights without preprocessing. However, no model performs well with disjointed and unstructured data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All Transcripts Are the Same&lt;/strong&gt;: Each conversation is unique. For instance, internal calls differ significantly from client calls, requiring separate handling.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwhqm5tbrac974qlh4tf.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwhqm5tbrac974qlh4tf.jpg" alt="What people miss with getting quality results from LLMs" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Readability Equals Accuracy&lt;/strong&gt;: Just because a transcript looks clean doesn’t mean the insights derived from it are accurate. The system's interpretation can differ from human understanding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Misunderstanding Quotes&lt;/strong&gt;: Users often assume that any given quote can represent the data accurately, but the selection and structure matter greatly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Readable Transcripts Guarantee Insights&lt;/strong&gt;: The assumption that a readable transcript guarantees accurate insights is misleading; the system's lens of perception plays a crucial role.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Nature of Conversational Data
&lt;/h2&gt;

&lt;p&gt;Conversational data is inherently complex. Unlike structured data, which fits neatly into rows and columns, conversations are fluid and often contain nuances that can be easily overlooked. Here are some common problems with raw transcripts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ambiguity&lt;/strong&gt;: Names can be misidentified or coded as letters (e.g., ‘A’ for ‘InsightLeader’), complicating analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disorganized Format&lt;/strong&gt;: From PDFs to voice recordings, the format can vary greatly, impacting how you extract valuable insights.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Core Pipeline: Clean → Process → Identify
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxy94t2jlj4kh9uxol3e9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxy94t2jlj4kh9uxol3e9.png" alt="Processing conversation data" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To tackle the messiness of conversational data, we often follow a core pipeline:&lt;/p&gt;

&lt;h3&gt;
  
  
  Cleaning
&lt;/h3&gt;

&lt;p&gt;This is the first step where standard data cleaning procedures come into play. You need to ensure that the text is free from noise—think filler words, background chatter, or irrelevant comments. &lt;/p&gt;

&lt;h3&gt;
  
  
  Processing
&lt;/h3&gt;

&lt;p&gt;Once cleaned, the next step is to preprocess the data. This involves segmenting the transcript into coherent parts, making it easier to manage. For instance, separating comments by users allows for clearer analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Identification
&lt;/h3&gt;

&lt;p&gt;This step involves identifying the speakers and the context of the conversation. Are you dealing with a focus group, a tutorial, or a one-on-one interview? The answer shapes how you approach the analysis. &lt;/p&gt;

&lt;h2&gt;
  
  
  Solving Transcript Problems With Practical Techniques
&lt;/h2&gt;

&lt;p&gt;Now that we've laid the groundwork, let’s explore some practical techniques for overcoming common transcript challenges:&lt;/p&gt;

&lt;h3&gt;
  
  
  Detecting Conversation Types
&lt;/h3&gt;

&lt;p&gt;Identifying call types helps in processing different transcripts effectively. For example, insights gleaned from a focus group can differ significantly from those derived from a tutorial.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using AI + Analysis Models for Metadata Extraction
&lt;/h3&gt;

&lt;p&gt;Leveraging AI models allows us to glean essential metadata from conversations—like identifying customers, their company size, or even specific sentiments expressed during the call. &lt;/p&gt;

&lt;h3&gt;
  
  
  Structuring Transcripts With Index Parsing
&lt;/h3&gt;

&lt;p&gt;I developed an index parsing approach that manipulates text to create a structured format, making it easier to analyze and retrieve information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hybrid Named Entity Recognition (NER)
&lt;/h3&gt;

&lt;p&gt;A mix of LLMs (Large Language Models) and rule-based methods can tackle the challenge of identifying speakers—even when names are outliers or coded.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling Disjointed Transcripts
&lt;/h3&gt;

&lt;p&gt;Disjointed conversations can be tricky. The best technique I’ve found involves using an LLM to process the entire conversation. While it’s a costly approach, it tends to yield the most accurate results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Impact of Transcript Analysis
&lt;/h2&gt;

&lt;p&gt;In dozens of real-world cases working with Insight7, transcript analysis didn’t just save time — it revealed patterns and opportunities that teams acted on immediately. For example, sales teams discovered that customers were dropping off not because of price, but due to integration and implementation concerns, prompting demos and onboarding changes that boosted close rates. Customer-service operations exposed frustration not with response speed but with repeated handoffs and conflicting answers — leading to the adoption of an owner-agent model and higher CSAT scores. On the coaching front, managers used transcript-driven metrics (like talk ratio, missed value-recaps, failure to “ask next step”) to give precise feedback, resulting in improved call quality and more predictable follow-ups. Product teams even used recurring customer complaints to drive roadmap changes, showcasing how Insight7 makes analyzing interviews faster and more impactful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Can You Extract Goals From Transcripts? Absolutely.
&lt;/h2&gt;

&lt;p&gt;With a refined system that adequately identifies various conversation types, we can effectively analyze and evaluate transcripts. This capability empowers CEOs and project managers to make insightful decisions based on their data.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Insight7 Makes This Entire Process Automatic
&lt;/h3&gt;

&lt;p&gt;At Insight7, we’ve developed cutting-edge tools that automate the transcription and analysis of conversations in over 60 languages. Here’s how we deliver value:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Clear Actionable Insights&lt;/strong&gt;: We surface recurring themes, sentiment, pain points, and meaningful quotes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visualization&lt;/strong&gt;: Our dashboards, journey maps, and scorecards help visualize findings for easy interpretation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collaboration and Reporting&lt;/strong&gt;: Designed for product, sales, CX, and research teams, our platform supports collaboration and evidence-based decision-making—all while ensuring enterprise-grade security.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In sales and customer service, understanding conversations isn't just about transcripts; it’s about transforming unstructured data into actionable insights. By embracing the challenges of transcript analysis, we can extract the gold nuggets that lie within conversations and drive informed decision-making.&lt;/p&gt;

&lt;p&gt;Original Post: &lt;a href="https://insight7.io/extracting-gold-from-conversations-the-hidden-challenges-of-transcript-analysis/" rel="noopener noreferrer"&gt;https://insight7.io/extracting-gold-from-conversations-the-hidden-challenges-of-transcript-analysis/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>nlp</category>
      <category>productivity</category>
      <category>architecture</category>
    </item>
    <item>
      <title>A Week, an Idea, and an AI Evaluation System: What I Learned Along the Way</title>
      <dc:creator>Ganiyu Olalekan</dc:creator>
      <pubDate>Wed, 03 Dec 2025 11:51:22 +0000</pubDate>
      <link>https://dev.to/ganiyuolalekan/a-week-an-idea-and-an-ai-evaluation-system-what-i-learned-along-the-way-4hl1</link>
      <guid>https://dev.to/ganiyuolalekan/a-week-an-idea-and-an-ai-evaluation-system-what-i-learned-along-the-way-4hl1</guid>
      <description>&lt;h2&gt;
  
  
  How the Project Started
&lt;/h2&gt;

&lt;p&gt;I remember the moment the evaluation request landed in my Slack. The excitement was palpable—a chance to delve into a challenge that was rarely explored. The goal? To create a system that could evaluate the performance of human agents during conversations. It was like embarking on a treasure hunt, armed with nothing but a week’s worth of time and a wild idea. Little did I know, this project would not only test my technical skills but also push the boundaries of what I thought was possible in AI evaluation.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Rarely Explored Problem Space
&lt;/h2&gt;

&lt;p&gt;Conversations are nuanced; they’re filled with emotions, tones, and subtle cues that a machine often struggles to decipher. This project was an opportunity to explore a domain that needed attention—a chance to bridge the gap between human conversation and machine understanding.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Needed to Be Built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F94jziioupd2kwox0q6l3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F94jziioupd2kwox0q6l3.jpg" alt="Building an agent evaluation system" width="800" height="711"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With the clock ticking, the mission was clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create a conversation evaluation framework&lt;/strong&gt; capable of scoring AI agents based on predefined criteria.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provide evidence&lt;/strong&gt; of performance to build trust in the evaluation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ensure that the system could adapt&lt;/strong&gt; to various conversational styles and tones.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What made this mission so thrilling was the challenge of designing a system that could accurately evaluate the intricacies of human dialogue—all within just one week.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Made the Work Hard (and Exciting)
&lt;/h2&gt;

&lt;p&gt;This project was both daunting and exhilarating. I was tasked with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Understanding the nuances of human conversation:&lt;/strong&gt; How do you capture the essence of a chat filled with sarcasm or hesitation?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developing a scoring rubric:&lt;/strong&gt; A clear, structured approach was essential to avoid ambiguity in evaluations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterating quickly:&lt;/strong&gt; With a week-long deadline, every hour counted, and quick feedback loops became my best friends.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Despite the challenges, the thrill of creating something groundbreaking kept me motivated. The feeling of something new always excites me—it’s unpredictable, and there was a chance we would fail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uekdf6vc3urte317ken.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uekdf6vc3urte317ken.jpg" alt="Key metrics to quality in evaluations" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned While Building the Evaluation Framework
&lt;/h2&gt;

&lt;p&gt;Through the highs and lows of this intense week, I gleaned valuable insights that I want to share with fellow learners and solution finders:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quality isn’t an afterthought—it's a system.&lt;/strong&gt; Building a reliable evaluation pipeline requires clear rubrics, structured scoring, and consistent measurement rules that remove ambiguity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human nuance is harder than model logic.&lt;/strong&gt; Evaluating conversations means dealing with tone shifts, emotions, sarcasm, hesitation, filler words, incomplete sentences, and even misspellings from transcriptions. Teaching an AI to understand that required deeper work than I expected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Criteria must be precise or the AI will drift.&lt;/strong&gt; Any vague or loosely defined rubric leads to inconsistent scoring. I learned the importance of turning human expectations into measurable, testable standards.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq3stlife68jt1k05inka.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq3stlife68jt1k05inka.jpg" alt="Key decisions to quality in evaluations" width="800" height="543"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Evidence-based scoring builds trust.&lt;/strong&gt; It wasn’t enough for the system to score the agent—we also had to show why it scored that way. Extracting high-quality evidence became a core pillar of the system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation is iterative.&lt;/strong&gt; Early versions looked “okay,” but actual conversations exposed weaknesses immediately. Each iteration sharpened the model’s accuracy, detection skills, and ability to generalize.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge cases are the real teachers.&lt;/strong&gt; Background noise, overlapping speakers, low empathy, sudden escalations, or overly long pauses pushed the evaluation system to become more robust.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time pressure forces clarity.&lt;/strong&gt; With just one week, I had to prioritize essentials, design fast feedback loops, and build only what truly mattered. That constraint was actually a strength.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A good evaluation system becomes a product.&lt;/strong&gt; What started as a one-week project evolved into one of our most popular services because quality, clarity, and trust are universal needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How the System Works (High-Level Overview)
&lt;/h2&gt;

&lt;p&gt;The evaluation system I built operates on a multi-faceted approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data Collection:&lt;/strong&gt; Conversations are transcribed and analyzed in over 60 languages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation on Rubrics:&lt;/strong&gt; The AI analyzes each transcript and evaluates performance against each sub-criteria using our Evaluation Data Model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scoring Mechanism:&lt;/strong&gt; Agents are evaluated against predefined rubrics, with evidence provided to justify scores. Each criterion is scored out of 100, and sub-criteria are weighted accordingly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Summary and Breakdown:&lt;/strong&gt; Each evaluation includes a summary of performance, a breakdown of scores, and quotes from the transcript that support the evaluation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This approach not only streamlines the evaluation process but also empowers teams to make informed decisions quickly—a necessity in today’s world.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Impact — How Teams Use It
&lt;/h2&gt;

&lt;p&gt;Since launching the evaluation system, teams across various sectors—product, sales, customer experience, and research—have leveraged it to enhance their operations. The feedback has been overwhelmingly positive. Teams are now able to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify strengths and weaknesses in AI interactions.&lt;/li&gt;
&lt;li&gt;Provide targeted training to improve agent performance.&lt;/li&gt;
&lt;li&gt;Foster a culture of continuous improvement driven by data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real impact lies in how this project has enabled teams to transform conversations into actionable insights, ultimately leading to better customer experiences and business outcomes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion — From One-Week Sprint to Flagship Product
&lt;/h2&gt;

&lt;p&gt;What started as a one-week sprint has now evolved into a flagship product that continues to grow and adapt. The journey taught me that the intersection of human conversation and AI evaluation is not just a technical endeavor; it’s about understanding the essence of communication itself.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“I build intelligent systems that help humans make sense of data, discover insights, and act smarter.” &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This project was a testament to that philosophy.&lt;/p&gt;

&lt;p&gt;If you’re a learner or solution finder, remember that every challenge is an opportunity for growth. Embrace the journey, stay curious, and keep pushing the boundaries of what’s possible. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orginal Post: &lt;a href="https://insight7.io/a-week-an-idea-and-an-ai-evaluation-system-what-i-learned-along-the-way/" rel="noopener noreferrer"&gt;https://insight7.io/a-week-an-idea-and-an-ai-evaluation-system-what-i-learned-along-the-way/&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devjournal</category>
      <category>learning</category>
    </item>
    <item>
      <title>Steps Involved in Selecting a Model (Model Selection)</title>
      <dc:creator>Ganiyu Olalekan</dc:creator>
      <pubDate>Tue, 15 Mar 2022 10:29:59 +0000</pubDate>
      <link>https://dev.to/ganiyuolalekan/steps-involved-in-selecting-a-model-model-selection-1d9n</link>
      <guid>https://dev.to/ganiyuolalekan/steps-involved-in-selecting-a-model-model-selection-1d9n</guid>
      <description>&lt;p&gt;Model selection is a key ingredient in the long and essential series of steps involved in creating a machine learning (ML) model that would be deployed into production.&lt;/p&gt;

&lt;p&gt;This article aims to act as a guide to machine learning engineers new to the process of model selection in machine learning (ML).&lt;/p&gt;

&lt;p&gt;We’ll start by understanding what model selection is:&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Model Selection
&lt;/h2&gt;

&lt;p&gt;Model selection is the task (or process) of &lt;strong&gt;selecting&lt;/strong&gt; a statistical model from a &lt;strong&gt;set of candidate models&lt;/strong&gt;, given data. &lt;a href="https://en.wikipedia.org/wiki/Model_selection"&gt;Wikipedia&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What this implies is that; model selection is the activity of undergoing a series of events (tasks/processes). This series of activities help us to determine if a statistical model (among others) is best suited to make predictions for a task.&lt;/p&gt;

&lt;p&gt;In selecting a model we start by inspecting our dataset because everything we do afterward only matters when we know the kind of data we’re working with.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is the dataset clean?
&lt;/h2&gt;

&lt;p&gt;So to begin with, we start by looking into the dataset for issues like missing data, incorrectly formatted values, etc. This process is called &lt;strong&gt;data cleaning&lt;/strong&gt;. It is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. &lt;a href="https://www.tableau.com/learn/articles/what-is-data-cleaning"&gt;tableau&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Trust me! &lt;strong&gt;Data Cleaning&lt;/strong&gt; is a very lengthy and tiring process. It is a whole subject of its own which is necessary and thus, valuable materials to assist those new to it is available in the &lt;strong&gt;further reading&lt;/strong&gt; section below.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What is the size of the dataset?
&lt;/h2&gt;

&lt;p&gt;The next thing we look into will be the size of the data. How big is the data? Is the data big enough to be split into 3 sets (Train, Validation, and Test set) or is it so small we can’t even extract a good enough test set (example: the iris dataset).&lt;/p&gt;

&lt;p&gt;Let’s start by identifying how we can address the small dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do we define a small dataset?
&lt;/h2&gt;

&lt;p&gt;A dataset of 1,000 sets and lower can be considered small. A set higher than 1000 can still be considered small based on the problem you’re trying to solve.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;if you try to process a small data set naively, it will still work. If you try to process a large data set naively, it will take orders of magnitude longer than acceptable (and possibly exhaust your computing resources as well). ~&lt;a href="https://www.bi.wygroup.net/digital-transformation/what-is-the-difference-between-big-data-large-data-set-data-stream-and-streaming-data/"&gt;Carlos Barge&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I consider the metrics by &lt;a href="https://www.bi.wygroup.net/digital-transformation/what-is-the-difference-between-big-data-large-data-set-data-stream-and-streaming-data/"&gt;Carlos Barge&lt;/a&gt; to be more appropriate for distinguishing a small from a large dataset. What constitutes a large dataset isn’t just the size of the rows but also the size of the columns.&lt;/p&gt;

&lt;p&gt;After defining a dataset as small, various steps should be taken to select a model for that dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: When performing a model evaluation, consider the rule of thumb for training a model.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Your model should train on at least an order of magnitude more examples than trainable parameters &lt;a href="https://developers.google.com/machine-learning/data-prep/construct/collect/data-size-quality#the-size-of-a-data-set"&gt;developers.google.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These steps include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Transform categorical columns to numeric (If any)&lt;/li&gt;
&lt;li&gt;Perform a k-fold cross-validation&lt;/li&gt;
&lt;li&gt;Elect candidate models&lt;/li&gt;
&lt;li&gt;Perform Model Evaluation&lt;/li&gt;
&lt;li&gt;Model selection&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To explain this better, I would be making use of the &lt;a href="https://www.lac.inpe.br/~rafael.santos/Docs/CAP394/WholeStory-Iris.html"&gt;iris dataset&lt;/a&gt; to examine the measures listed above. The complete notebook on the model selection process for the iris dataset set can be on my &lt;a href="https://www.kaggle.com/ganiyuolalekan/model-selection-for-small-dataset"&gt;Kaggle page&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transform categorical columns to numeric
&lt;/h2&gt;

&lt;p&gt;Machine learning models are unable to interpret non-numeric values, so before proceeding, all numeric columns need to be transformed to numeric values.&lt;/p&gt;

&lt;p&gt;In most cases, columns that would need to be transformed to numeric values would be categorical columns like &lt;code&gt;[low, medium, high]&lt;/code&gt; or &lt;code&gt;[Yes, No]&lt;/code&gt; or &lt;code&gt;[Male, Female]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://scikit-learn.org/stable/"&gt;Scikit-learn&lt;/a&gt; is a toolbox that was built to handle these conversions: they include the &lt;code&gt;LabelEncoder&lt;/code&gt;, &lt;code&gt;OrdinalEncoder&lt;/code&gt;, &lt;code&gt;OneHotEncoder&lt;/code&gt;, etc. All this is available in &lt;code&gt;sklearn.preprocessing&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Resources to articles that provide clarification on these tools can be found in the &lt;strong&gt;further reading&lt;/strong&gt; section of this article.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Perform a k-fold cross-validation
&lt;/h2&gt;

&lt;p&gt;The k-fold cross-validation is a procedure used to estimate the skill of the model on new data. &lt;a href="https://machinelearningmastery.com/k-fold-cross-validation/"&gt;machine learning mastery&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IDswFj9j--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eatuhgoccc7xwsykykrs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IDswFj9j--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eatuhgoccc7xwsykykrs.png" alt="K-fold Cross-Validation" width="607" height="411"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;K-fold cross-validating works by splitting the dataset to a specified number of folds (say 5) and then shifting the position of the test set to a single fold at each iteration (as described above).&lt;/p&gt;

&lt;p&gt;After performing the K-fold cross-validation, we then end up with the N number of the same dataset with N different training and testing sets (where N is the number of splits applied on the dataset).&lt;/p&gt;

&lt;p&gt;There are two (2) ways to use k-fold cross-validation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Using k-fold cross-validation for evaluating a model’s performance&lt;/li&gt;
&lt;li&gt;Using k-fold cross-validation for hyper-parameter tuning&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;There’s a lovely article by &lt;a href="https://rukshanpramoditha.medium.com/"&gt;Rukshan Pramoditha&lt;/a&gt; titled &lt;a href="https://towardsdatascience.com/k-fold-cross-validation-explained-in-plain-english-659e33c0bc0"&gt;k-fold cross-validation explained in plain English&lt;/a&gt; which explains both. We would however use k-fold for evaluating model performance in this test case.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="s"&gt;"""
Creating a K cross validation fold with sklearn using the iris dataset
"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_iris&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KFold&lt;/span&gt;


&lt;span class="c1"&gt;# Loads iris dataset
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_iris&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;return_X_y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Splits dataset into 5 folds
&lt;/span&gt;&lt;span class="n"&gt;iris_kf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KFold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_splits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# List to store dataset across the the various folds
&lt;/span&gt;&lt;span class="n"&gt;kf_data_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train_index&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;test_index&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
        &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train_index&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
        &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;test_index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;train_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_index&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;iris_kf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The purpose of performing a k-fold cross-validation is to expand the dataset.&lt;/p&gt;

&lt;p&gt;What do I mean by this? The iris dataset for instance has a total of 150 data which is so small that extracting a test and cross-validation set will leave us with very little to train with.&lt;/p&gt;

&lt;p&gt;By splitting the dataset into a training and test set across 5 different instances here, we try to maximize the use of the available data for training and then test the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Elect candidate models
&lt;/h2&gt;

&lt;p&gt;Now that we’ve successfully split our dataset in 5 K-Fold we can proceed to elect the candidate models. This is the instance where we look at the kind of task we are solving and the models that can solve/address it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4rUEI3ud--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mba1cu20yxsplglv653d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4rUEI3ud--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mba1cu20yxsplglv653d.png" alt="Iris Flower Classification" width="804" height="720"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Iris dataset is a classification task. It has four (4) feature columns which are &lt;code&gt;sepal length (cm)&lt;/code&gt;, &lt;code&gt;sepal width (cm)&lt;/code&gt;, &lt;code&gt;petal length (cm)&lt;/code&gt;, and &lt;code&gt;petal width (cm)&lt;/code&gt;. All are continuous feature columns.&lt;/p&gt;

&lt;p&gt;By visualizing the dataset, we can tell that the &lt;code&gt;petal width (cm)&lt;/code&gt; and &lt;code&gt;petal length (cm)&lt;/code&gt; feature column is linearly separable from the other feature columns. Well, this and probably more relationships.&lt;/p&gt;

&lt;p&gt;Question: What models best decide these relationships?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I’ll go straight to listing out models that can determine these relationships. For more on the reasons, we picked the models check out the further reading section.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We’ll be electing the &lt;strong&gt;&lt;code&gt;LogisticRegression&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;SVC&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;KNN&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;RandomForestClassifier&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Perform Model Evaluation
&lt;/h2&gt;

&lt;p&gt;Now that we’ve decided on the machine learning (ML) models, we can proceed to evaluate the models with our dataset using cross-validation.&lt;/p&gt;

&lt;p&gt;We would make use of the &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html"&gt;&lt;strong&gt;&lt;code&gt;sklearn.model_selection.cross_val_score&lt;/code&gt;&lt;/strong&gt;&lt;/a&gt; to cross-validate the dataset and get the scores on the model performance across each fold.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="s"&gt;"""
Model performance on the iris dataset
Trying to evaluate best performing models using cross validation.
"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_iris&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.svm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SVC&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.neighbors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KNeighborsClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;model_performance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="s"&gt;"""
    Takes a record of the model performance during cross validation
    returns the record of the model performance along with the
            model performance rating of the stating which model performed
            best and which performed worst
    """&lt;/span&gt;

    &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;'Logistic Regression'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
        &lt;span class="s"&gt;'K-Nearest Neighbor'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
        &lt;span class="s"&gt;'Random Forest Classifier'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
        &lt;span class="s"&gt;'Support Vector Classifier'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;avg_model_performance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'accuracy'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'scores'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;
        &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'mean_score'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;avg_model_performance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nb"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Model Performance Rating'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_model_performance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;


&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_iris&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;return_X_y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_performance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;KNeighborsClassifier&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;SVC&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Model Performance Rating&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Model Performance Rating'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3m6nlmy_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/98jsz5njepkirzx5bb57.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3m6nlmy_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/98jsz5njepkirzx5bb57.png" alt="Iris Model Performance" width="880" height="148"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Selection
&lt;/h2&gt;

&lt;p&gt;After cross-validating the dataset we can now conclude that the best performing models are the Logistic Regression and the K-Nearest Neighbor models which both have an accuracy of 97.33%.&lt;/p&gt;

&lt;p&gt;This implies that either of them would be efficient for deployment. Now based on the needs of the problem, we can now decide on either of the models. If you have needs for a model-based learning algorithm, you can choose the KNN or the Logistic Regression for instance-based learning.&lt;/p&gt;

&lt;p&gt;After cross-validating the dataset we can now conclude that the best performing models are the Logistic Regression and the K-Nearest Neighbor models which both had an accuracy of 97.33%.&lt;/p&gt;

&lt;p&gt;Performing cross-validation experiments like this on a large dataset would be very expensive computational-wise.&lt;/p&gt;

&lt;p&gt;Now that we’ve figured out how to address the smaller datasets, how do we address larger ones?&lt;/p&gt;

&lt;h2&gt;
  
  
  How do we define a large dataset?
&lt;/h2&gt;

&lt;p&gt;What do I mean by a large dataset? A dataset of about 10,000 rows upwards is large, while datasets within the range of say 2,000 to 10,000 are reasonably medium. Of course, this metric isn’t the best.&lt;/p&gt;

&lt;p&gt;If you try processing a large dataset naively it will take longer processing time and exhaust computing power. This is a more precise metric.&lt;br&gt;
After determining your dataset is large. what are the steps for selecting a model for the dataset then?&lt;/p&gt;

&lt;p&gt;Well, unlike with smaller datasets, we can’t process this dataset naively. Thus, we have to split it. This is where reducing the dataset to three (3) set for training and evaluation comes to play.&lt;/p&gt;

&lt;p&gt;Before we proceed though, let’s list the steps required to select a model for larger datasets:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Transform Categorical Columns to Numeric (If any)&lt;/li&gt;
&lt;li&gt;Scale Continuous Columns (if necessary)&lt;/li&gt;
&lt;li&gt;Split the Dataset&lt;/li&gt;
&lt;li&gt;Elect Candidate Model&lt;/li&gt;
&lt;li&gt;Perform Model Evaluation&lt;/li&gt;
&lt;li&gt;Model Selection&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can proceed with these steps if you have a cleaned dataset. The &lt;a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques"&gt;House Prices — Advanced Regression Techniques&lt;/a&gt; dataset would be utilized for tutorial purposes as we analyze the steps involved in selecting models for larger datasets.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques"&gt;House Prices&lt;/a&gt; dataset isn’t so large a dataset itself but should explain the concept behind our steps nicely.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The notebook compiling the codes for the dataset and the work we did can be found on my &lt;a href="https://www.kaggle.com/ganiyuolalekan/model-selection-for-larger-dataset"&gt;Kaggle page&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I would jump right into splitting the dataset. Below is the code for cleaning the dataset and transforming the columns — in case you desire to follow with the &lt;a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques"&gt;House Prices&lt;/a&gt; dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="s"&gt;"""
Cleaning and transforming the housing price dataset
House Prices - Advanced Regression Techniques
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.impute&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SimpleImputer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.compose&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ColumnTransformer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OrdinalEncoder&lt;/span&gt;


&lt;span class="c1"&gt;# Loading both train and test set into a dataframe
&lt;/span&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"house_prices/train.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Id'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;test_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"house_prices/test.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Id'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Merging both train and test set into one data frame
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_dataset&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;#Extracing out target, in which we hope to predict
&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"SalePrice"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;to_numpy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Dropping some dataset columns
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="s"&gt;"Alley"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"FireplaceQu"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"PoolQC"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Fence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"MiscFeature"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"SalePrice"&lt;/span&gt;
&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Specifying the continuous columns
&lt;/span&gt;&lt;span class="n"&gt;continuous_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Specifying the categorical columns
&lt;/span&gt;&lt;span class="n"&gt;categorical_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;col&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;continuous_col&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Creating the continuous columns data pipeline
&lt;/span&gt;&lt;span class="n"&gt;continuous_data_pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'imputer'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SimpleImputer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"median"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'num_scaler'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Creating the categorical columns data pipeline
&lt;/span&gt;&lt;span class="n"&gt;categorical_data_pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'freq_imputer'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SimpleImputer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'most_frequent'&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'cat_encoder'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OrdinalEncoder&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Creating a data pipeline for the whole dataset
&lt;/span&gt;&lt;span class="n"&gt;housing_price_pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ColumnTransformer&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"continous"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;continuous_data_pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;continuous_col&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"categorical"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;categorical_data_pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;categorical_col&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Transformed instance of the dataset
# Remember, target (variable) contains it's target values
&lt;/span&gt;&lt;span class="n"&gt;transformed_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;housing_price_pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Split the Dataset
&lt;/h2&gt;

&lt;p&gt;The reason we perform an evaluation on machine learning (ML) models is to ensure they don’t under-fit or over-fit.&lt;/p&gt;

&lt;p&gt;We were able to evaluate the iris data-set (a small data-set) using cross-validation, but given our data-set isn’t as small, validating naively would be computationally expensive.&lt;/p&gt;

&lt;p&gt;Therefore, we have to split the dataset into a train and test set. Given the entire dataset has a shape of &lt;strong&gt;(1460, 80)&lt;/strong&gt;, and &lt;strong&gt;(1460, 74)&lt;/strong&gt; after cleaning and transformation, we can perform cross-evaluation on the train-set and evaluate our model performance on the test set.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="s"&gt;"""
Splitting the merged dataset of the housing price dataset
Merger:
https://gist.github.com/ganiyuolalekan/8e2acab87a0d4c51ff7fcd59a9ad8c4c
House Prices - Advanced Regression Techniques
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;

&lt;span class="c1"&gt;# Splitting the dataset
&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;transformed_dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Elect Candidate Model
&lt;/h2&gt;

&lt;p&gt;Now that we’ve perfectly split the dataset into both train and test sets, we then proceed to elect models that can solve this task.&lt;/p&gt;

&lt;p&gt;We have to understand the dataset. I talked about it in my notebook &lt;a href="https://www.kaggle.com/ganiyuolalekan/house-prices-prediction-beginner/notebook"&gt;House Prices Prediction (Beginner)&lt;/a&gt; where I gave an &lt;a href="https://www.kaggle.com/ganiyuolalekan/house-prices-prediction-beginner#2.1.-Overview-of-the-data"&gt;overview of the dataset.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, we’re dealing with a regression task consisting of lots of categorical features, having models with linear and decision-making abilities would be useful, like the Decision Tree Regressor or Random Forest Regressor. But let’s go for the Random Forest Regressor since it’s more of an ensemble of Decision Trees.&lt;/p&gt;

&lt;p&gt;We should also pick models like Support Vector Regressor, Linear Regression, and K-Neighbors Regressor since we’re performing evaluations.&lt;/p&gt;

&lt;p&gt;The XGBoost will prove to be a very vital tool in your ML journey and I suggest examining its usage in the notebook &lt;a href="https://www.kaggle.com/dansbecker/xgboost"&gt;XGBoost &lt;/a&gt;by Kaggle grandmaster &lt;a href="https://www.kaggle.com/dansbecker"&gt;Dans Becker&lt;/a&gt;. More resources on XGBoost in the &lt;strong&gt;further reading&lt;/strong&gt; section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Perform Model Evaluation
&lt;/h2&gt;

&lt;p&gt;Now that we’ve successfully split our dataset, and elected the models we want to use. It’s time to see how the individual models perform on the training dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vIvR2rD8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ur4gnejb0rrm06ayma0f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vIvR2rD8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ur4gnejb0rrm06ayma0f.png" alt="Housing Price Performance" width="785" height="68"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Beyond doubt, the Random Forest Regressor performed best, outperforming the Linear Regression model approximately 3x. Although since our focus is on model selection I avoided cross-validating and fine-tuning the models.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In most cases, I would fine-tune and cross-validate the model (using grid search) while searching out the best accuracy each model can produce before making a decision. But the model’s default parameters are also decent enough for this task. So let’s leave it simple.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Model Selection
&lt;/h2&gt;

&lt;p&gt;After splitting the dataset, electing the candidate model, and performing model evaluation we can come to the conclusion that the Random Forest Regressor will be best suited for deployment having a mean absolute error (MAE) of 6732.92.&lt;/p&gt;

&lt;p&gt;Although we didn’t quite fine-tune the model. We can get a much better MAE by fine-tuning the Random Forest Regressor, but the point has been established.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You could try out the XGBoost and compare it to see if it performs better. What if you fine-tune the XGBoost model as well!!!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We’ve proven that model selection is a key ingredient in the lengthy series of steps involved in creating a machine learning (ML) model that would be deployed into production.&lt;/p&gt;

&lt;p&gt;We showed the metrics for proving if a dataset is either small or large and the reason for cross-validating smaller sets and splitting the larger ones.&lt;/p&gt;

&lt;p&gt;We also talked about why we evaluate models and how we elect candidate models before model evaluation.&lt;/p&gt;

&lt;p&gt;I hope this guide proves to be effective even as you deploy them into your machine learning tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This article was originally published on &lt;a href="https://gmolalekan.medium.com/steps-involved-in-selecting-a-model-model-selection-bd7aaffbec4f"&gt;Medium&lt;/a&gt; by &lt;a href="https://gmolalekan.medium.com/"&gt;me&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Data Cleaning&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4"&gt;The Ultimate Guide to Data Cleaning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b"&gt;Data Cleaning with Python and Pandas&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Encoding Categorical Columns&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://medium.com/bycodegarage/encoding-categorical-data-in-machine-learning-def03ccfbf40"&gt;Encoding Categorical data in Machine Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/guide-to-encoding-categorical-features-using-scikit-learn-for-machine-learning-5048997a5c79"&gt;Guide to Encoding Categorical Features Using Scikit-Learn For Machine Learning&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scikit-Learn Models&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Support-vector_machine"&gt;Support Vector Machine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Random_forest"&gt;Random Forest&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm"&gt;K-Nearest Neighbor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Linear_regression"&gt;Linear Regression&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Logistic_regression"&gt;Logistic Regression&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Further Reading On Model Selection&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/a-short-introduction-to-model-selection-bb1bb9c73376"&gt;A “short” introduction to model selection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://machinelearningmastery.com/a-gentle-introduction-to-model-selection-for-machine-learning/"&gt;A Gentle Introduction to Model Selection for Machine Learning&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Associated Notebooks&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.kaggle.com/ganiyuolalekan/model-selection-for-small-dataset"&gt;Steps Involved in Selecting a Model For a Small Data-set&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kaggle.com/ganiyuolalekan/model-selection-for-larger-dataset"&gt;Steps Involved in Selecting a Model For a larger Data-set&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Book&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/"&gt;Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>sklearn</category>
      <category>modelselection</category>
      <category>kfold</category>
    </item>
  </channel>
</rss>
