DEV Community: Dancan Ngare

Building a Deep Learning Model to Detect Potato Diseases: My Journey with PlantVillage.

Dancan Ngare — Sun, 27 Jul 2025 08:11:41 +0000

As a data scientist with a passion for solving real-world problems, I recently embarked on a project that aimed to detect potato diseases using computer vision. This idea sprouted from the realization that in many parts of Kenya and globally, farmers often lack quick and affordable access to agricultural expertise. A simple app that identifies potato diseases like early blight, late blight or confirms if the leaf is healthy, could make a real difference.

The Dataset: PlantVillage

The first step was choosing the right dataset. The PlantVillage dataset on Kaggle offered exactly what I needed: a categorized collection of potato leaf images labeled as either healthy, early blight or late blight. These high-quality images made training a model feasible without needing to manually curate data, a huge relief!

Setting Up the Pipeline

I developed this project using TensorFlow and Keras in Python. I began by loading and preprocessing the dataset. Preprocessing is crucial in any machine learning or deep learning project, especially in computer vision. This is because raw data is rarely in a form that's immediately useful for training a model.
dataset = tf.keras.preprocessing.image_dataset_from_directory( "PlantVillage", shuffle=True, image_size=(256, 256), batch_size=32 )
Once loaded, I split the data into training, validation, and test sets using a custom partitioning function. This was critical to ensure the model could generalize well.

Data Augmentation & Preprocessing

To combat overfitting, I applied data augmentation techniques such as flipping, zooming, contrast adjustments, and random translations. I also normalized the images by scaling pixel values between 0 and 1. This step significantly boosted training stability.
data_augmentation = tf.keras.Sequential([ layers.RandomFlip("horizontal_and_vertical"), layers.RandomRotation(0.2), layers.RandomZoom(0.2), layers.RandomContrast(0.2), layers.RandomTranslation(0.2, 0.2) ])

The Model Architecture

I used a simple but effective Convolutional Neural Network (CNN) built with Keras’ Sequential API. The architecture included several convolutional layers with ReLU activation, max-pooling, dropout to reduce overfitting and a final softmax layer to classify into three categories.
One challenge I encountered here was tuning the number of filters and layers. At first, the model either underfit or overfit badly. It took a few iterations and the introduction of Dropout and EarlyStopping to stabilize training.
EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

Training & Evaluation

Training the model over 50 epochs with early stopping allowed me to halt once performance plateaued. I monitored accuracy and loss on both training and validation sets.
The final evaluation on the test set yielded impressive results, with accuracy comfortably above 90%. Visualizing the confusion matrix confirmed that most misclassifications were between early and late blight, which is expected due to their visual similarity.
Challenges I Faced
Like any worthwhile journey, this project had its fair share of obstacles:

1. Data Imbalance

Initially, I noticed that the healthy class had more images than others. This imbalance skewed model predictions. I overcame this by using data augmentation more aggressively on underrepresented classes and ensuring balanced batch generation.

2. Memory Limitations

Working with 256x256 images on a standard machine sometimes caused memory issues during training. I solved this by caching and prefetching datasets using AUTOTUNE, which optimized performance without requiring a GPU upgrade.
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=tf.data.AUTOTUNE)

3. Overfitting

Despite a good dataset, the model started overfitting after a few epochs. Dropout layers and data augmentation helped, but the real breakthrough came when I implemented early stopping, which gracefully halted training at the optimal point.

4. Evaluation Bias

My initial evaluation method didn’t give the full picture. Adding visualizations like confusion matrices, and sample predictions helped me interpret where the model struggled, especially between early and late blight.

Takeaways

This project taught me a lot about building robust computer vision pipelines:

Domain-specific preprocessing (like proper augmentation) is a game-changer.
Model evaluation is more than just accuracy—it’s about understanding behavior.
Simplicity wins: A moderately deep CNN, well-regularized and properly tuned, can outperform overly complex architectures.

What's Next?

I’m now working on deploying this model, enabling farmers to take a picture of a leaf and instantly receive a diagnosis. I’m also exploring transfer learning with models like MobileNetV2 for faster inference on edge devices.

Scraping Kenya’s Rental Market: A Real-World Web Scraping Project with Python

Dancan Ngare — Tue, 27 May 2025 19:30:32 +0000

When I first started diving into real-world data extraction, I wanted a project that would not only sharpen my web scraping skills but also deliver something useful, something with practical value to people looking for insights in a specific market. That’s when I turned my attention to Kenya’s real estate market.
Kenya’s housing sector, especially rental properties, is booming. With platforms like Property24 Kenya listing hundreds of new properties every week, there’s a wealth of data waiting to be tapped into. This article is a walk-through of how I built a Python-based web scraper that collects detailed listings of rental houses in Kenya and stores them in a structured CSV file for further analysis.

Why Scrape Property Listings?

Real estate listings provide rich, structured data that’s perfect for analysis. Whether you’re comparing pricing trends, examining location-based inventory or monitoring how property descriptions evolve over time, having access to a clean dataset makes all the difference.
Instead of manually copying property details from individual web pages (which would take forever), I wrote a script that does all the heavy lifting automatically.

Tools & Libraries Used

For this project, I stuck to Python and some of its most reliable libraries for web scraping and data handling:
• requests: To fetch web page contents.
• BeautifulSoup: For parsing the HTML.
• pandas: To structure and clean the data and finally save it as a CSV.

What the Script Does

Loops Through Pages

It starts from the first page of rental listings and increments through the pagination by updating the current_page variable. It checks if listings are found, if not, the loop breaks and scraping stops.

Parses and Extracts Key Information

Each rental listing is parsed to extract:
• Title
• Location
• Bedrooms
• Bathrooms
• House Size
• Description
• Price
All this information is stored as a dictionary and added to a list.

Cleans and Saves the Data

Finally, the list of dictionaries is converted into a pandas DataFrame. Empty rows (where all values are None) are dropped, and the dataset is saved to a CSV file named kenyarentalsfile1.csv.

Real-Life Use Cases

So, why is this kind of project useful?
• Market Analysis: Realtors and housing investors can use the dataset to understand trends.
• Academic Research: Urban planners and students can analyze spatial housing availability.
• Price Forecasting Models: Data scientists can train regression models on historical rental price trends.
• Business Intelligence Dashboards: You can build a Power BI or Tableau dashboard from the CSV file.

Challenges Faced

No project is without hurdles. Here are a few I encountered:

Inconsistent HTML Structure Some listings didn’t include all the expected fields like bedrooms or bathrooms. I solved this by wrapping each field in a conditional check.
Pagination Handling Initially, the script kept running endlessly because I didn’t break the loop after reaching an empty page. Adding a check for an empty all_rentals list fixed this.
Keyboard Interrupts / Timeouts Sometimes, the requests.get() call takes too long or hangs. This can be mitigated using timeouts and try-except blocks (a possible future enhancement).

Future Enhancements

• Add timeout & retry logic for better reliability
• Integrate logging to replace print statements
• Export to database for scalability
• Add filters like city-specific scraping or price ranges
• Build a dashboard using the cleaned CSV file

Final Thoughts

Building this scraper was a rewarding experience. It not only gave me real-world exposure to web scraping techniques but also taught me how to handle edge cases, optimize data collection, and store clean, structured data for analysis.
If you’re learning Python, this is a great beginner-to-intermediate level project to work on. It touches on many important concepts like iteration, HTML parsing, data cleaning, and file I/O. Plus, it’s highly practical. You will end up with a dataset that can be analyzed and visualized.

Check Out the Project

GitHub Repository
Feel free to improve the code. Contributions are welcome!

Excel. The SteppingStone to a Career in Data Science

Dancan Ngare — Fri, 02 May 2025 05:13:36 +0000

When starting a journey into the world of data science and analysis, it can feel like there are countless tools and technologies to master. The sheer number of options can be overwhelming from Python, R to SQL and machine learning algorithms. However, there’s one tool that remains an essential first step in the process: Excel.
Excel is often the first software people encounter when diving into data science, and it serves as an excellent foundation for understanding core concepts and building the skills needed for more advanced data analysis. This article explores why Excel is the ultimate steppingstone to a career in data science and how mastering it can give you a solid edge in your learning journey.

Excel’s Core Features
Before jumping into complex programming languages or machine learning models, Excel introduces you to the basic building blocks of data science. It teaches you how to collect data and put it in rows and columns, store, manipulate and analyze the data, which are all crucial and essential skills in the data science world. Some key Excel features are particularly valuable for beginners.

Logical Functions
One of the most powerful tools in Excel is its logical functions, such as IF, AND and OR. These functions allow you to make decisions based on specific conditions, which is a core skill in data analysis.
For example, you might use the IF function to classify data based on set criteria. If you’re working with sales data, an IF statement can help you categorize sales as either ‘high’ or ‘low’ based on specific thresholds. This simple logical framework lays the groundwork for more complex data analysis and decision-making models.

Date Functions
Date analysis is another fundamental concept in data science. Excel offers a suite of date functions, like DATE, TODAY and MONTH that are essential when working with time-based data. Learning how to calculate the difference between two dates or extract specific elements of a date (such as the month or year) is crucial for analyzing trends over time.
For instance, if you are working with sales data, using Excel’s date functions allows you to calculate how long it took to complete a sales cycle, analyze seasonality trends, or track monthly performance.

Text Functions
Data analysis often involves cleaning and formatting data and Excel’s text functions are invaluable for this process. Functions like LEFT, RIGHT and CONCATENATE help you manipulate text data — a common task when preparing data for analysis.
A typical scenario might involve splitting a column with full names into separate first and last names or cleaning up inconsistent entries in a customer database. Mastering these text functions gives you the confidence to tackle messy datasets and prepare them for analysis.

How Excel Transforms Raw Data into Meaningful Insights
Once you’ve learned how to organize and format data, the next step is analysis. Excel makes this transition easy, offering powerful tools for turning raw data into actionable insights.

Data Organization & Analysis
Excel’s table and worksheet structure teaches you how to organize data in rows and columns. By learning how to properly structure your data, you can make it easier to analyze and draw meaningful conclusions.
One of Excel’s most important features for data analysis is the pivot table. Pivot tables allow you to quickly summarize large datasets, filter trends and generate dynamic reports without needing advanced programming skills. Whether you're tracking sales performance, analyzing customer demographics, or reviewing survey results, pivot tables can help you slice and dice the data to find patterns and insights.

Real-World Application
While Excel may seem like a basic tool, it is more than capable of handling complex tasks that are critical to data analysis. For example, you can build dashboards to visualize key performance indicators (KPIs), create charts to visualize trends and automate repetitive tasks with Excel’s built-in macros.
Let’s say you are analyzing customer feedback data. You can use Excel to clean up the text, analyze sentiment and even create a dashboard that displays the results. Excel gives you the flexibility to handle both small and large datasets, making it a versatile tool for a wide range of data science tasks.

Debunking the Myth: Is Excel Outdated for Data Science?
There’s a common misconception that Excel is outdated and incapable of supporting more advanced data science work. Many believe that tools like Python and R are essential for any serious data analysis.
The reality is that Excel is still a powerful tool for data science, especially for beginners. It provides a simple, user-friendly interface that allows you to quickly grasp fundamental data analysis concepts. While more advanced tools like Python and R are often used for larger datasets and machine learning tasks, Excel remains a vital tool in the data scientist's toolkit — particularly for tasks that require quick data manipulation, visualization and reporting.
Instead of viewing Excel as outdated, consider it a steppingstone that prepares you for more complex tools. By mastering Excel, you’ll develop a strong foundation that will make learning other tools like Power BI much easier.

Why Excel Is the Perfect Steppingstone to Data Science
Excel is the perfect entry point for beginners because it allows you to learn the foundational skills needed in data science without the steep learning curve of programming languages. Once you’re comfortable with Excel, you’ll have a solid grasp of key concepts like data cleaning, analysis, and visualization. This prepares you to move on to more advanced tools like Python, R or Power BI with ease.
Excel also provides an accessible environment for learning. Unlike Python or R, which require coding knowledge, Excel’s drag-and-drop interface and built-in functions make it a great tool for experimenting and learning by doing. It’s the ideal way to build confidence before stepping into the more complex world of data science.

Start Practicing Excel Today
Excel may not be the most advanced tool in the data science world, but it’s the perfect starting point for anyone looking to break into the field. Whether you’re cleaning up data, analyzing trends or building reports. Mastering Excel will give you the skills and confidence to take the next steps in your data science journey.
So, if you’re just starting out, don’t wait, dive into Excel today. The more you practice, the more you’ll discover how powerful and versatile this tool is in the world of data science. With Excel as your foundation, you’ll be well on your way to mastering the next steps in your data science career.

How Data Science and Analytics Are Transforming Industries Today

Dancan Ngare — Fri, 25 Apr 2025 05:10:03 +0000

How are some companies seeming to crack the code for success? It’s not happening by chance—it’s the power of data science that is changing the game. Whether it’s health, hospitality, aviation, technology, or the finance sector, data science is continuously changing how well we are doing things. This is being achieved by understanding data that is being collected, analyzed, and with insights being presented to invoke informed decisions.
Data science is being seen as the art of deriving meaningful insights from complex data. It is combining statistics, mathematics, and computer science to analyze and interpret vast datasets. By employing advanced algorithms and techniques, data science is transforming raw information into actionable knowledge. This associative field is playing an important role in predicting trends, identifying patterns, and facilitating informed decision-making across diverse industries.

The building blocks of Data Science and Analytics

Data science is combining statistical methods, machine learning, computer science, and domain expertise to extract meaningful insights from structured and unstructured data. Analytics, a subset of data science, is focusing on interpreting data to inform decision-making. Let’s start exploring some of these industries.

Data science and Analytics In Healthcare

Due to the large amounts of clinical data that the healthcare sector is generating—such as patients’ prescriptions, clinical reports, data about the purchase of medicines, medical insurance-related data, and laboratory results—there is lying a great opportunity to collect and analyze this data. The huge volume of data is being pooled together and analyzed effectively using machine-learning algorithms. Analysts are examining data and identifying patterns to derive insights that are helping in better decision-making, ultimately leading to improved quality of patient care. This approach is aiding in understanding trends and improving the outcomes of medical care, life expectancy, early detection, and identification of diseases at initial stages, along with the required treatment at an affordable cost.

Data science in Finance

Data Science is emerging as a transformative force in the finance industry, revolutionizing how professionals are approaching data analysis, risk management, and decision-making. In finance, which is dealing with vast amounts of daily data, data science is offering powerful tools and methodologies to extract valuable insights and drive informed strategies.

Risk management is becoming one of the primary applications of data science in finance. Financial institutions are leveraging advanced analytics and machine learning algorithms to assess and mitigate risks effectively. These tools are analyzing historical data, market trends, and economic indicators to predict potential risks, enabling proactive decision-making and mitigation.

Data science also is contributing to regulatory compliance within the finance industry. Financial institutions are having to adhere to stringent regulations and reporting requirements. Data science tools are enabling automated compliance checks, ensuring that institutions are operating within the regulatory framework and avoiding potential penalties.

Data Science In Manufacturing

The manufacturing industry is undergoing a significant transformation with the integration of data science, fundamentally changing the landscape of efficiency and productivity.
Manufacturing is remaining a data-rich industry, but traditional methods of data analysis often are falling short in providing actionable insights. The advent of data science, with its advanced analytical tools and algorithms, is presenting new opportunities for leveraging vast amounts of data to optimize production processes, reduce downtime, and enhance product quality.
Data is being collected from various sources, including sensors, production logs, and quality control records. The methodology is involving data preprocessing, feature selection, model training, and validation. Techniques such as regression analysis, classification, and clustering are being applied to identify patterns, predict outcomes, and uncover hidden relationships within the data.
The application of data science in manufacturing is leading to remarkable improvements in several key areas.

Data Science in Agriculture

Agriculture is playing an important role in the developing economy, so the use of digital farming in rural areas is becoming beneficial for the agriculture sector. The integration of data science in agriculture is ushering in a new era of farming, where decisions are no longer being based solely on tradition and intuition but are increasingly being informed by precise, data-driven insights. This shift towards agricultural data analytics is revolutionizing every aspect of farm management, from soil preparation to harvest planning.

Conclusion

Data science and analytics are transforming industries by enabling data-driven decision-making, optimizing processes, and fostering innovation. From personalized healthcare to manufacturing, their applications are being seen as vast and varied. However, realizing their full potential is requiring addressing challenges like data quality, ethical concerns, and talent shortages. As technology is advancing and industries are adapting, data science is remaining a catalyst for progress, shaping a future where insights are driving efficiency, sustainability, and growth. By embracing this revolution responsibly, organizations are unlocking opportunities that are redefining their industries and benefiting society at large.

How Microsoft Excel Powers Data Science and Analytics: Benefits, Tools, and Real-World Use Cases.

Dancan Ngare — Sat, 19 Apr 2025 07:47:31 +0000

Microsoft Excel has long been a cornerstone tool in the world of data science and data analytics, despite the rise of more advanced programming languages and platforms like Python, R, and specialized software such as Tableau or Power BI. Its accessibility, versatility, and robust functionality make it an indispensable asset for professionals and beginners alike. From small-scale data manipulation to complex analytical tasks, Excel remains a go-to solution for data scientists and analysts across industries. This article explores the critical role Excel plays in data science and data analytics, highlighting its strengths, applications, and enduring relevance in a rapidly evolving field.

Accessibility and User-Friendliness

One of Excel’s greatest strengths is its accessibility. With a familiar interface and widespread availability as part of the Microsoft Office suite, Excel is often the first tool that aspiring data professionals encounter. Unlike programming languages that require extensive learning curves, Excel allows users to dive into data manipulation with minimal setup. Its intuitive grid-based structure, combined with a point-and-click interface, lowers the barrier to entry for beginners in data science and analytics.

For small businesses or individuals without access to expensive software, Excel provides a cost-effective solution for data analysis. Its presence on most workplace computers means that employees can immediately begin working with data without the need for specialized training or infrastructure. This democratization of data analysis has made Excel a staple in industries ranging from finance to marketing, where quick insights are often needed without the complexity of advanced tools.

Data Organization and Cleaning

Data science and analytics workflows often begin with data preparation, a phase that can consume up to 80% of a professional’s time. Excel excels in this area, offering a suite of tools for organizing, cleaning, and transforming raw data. Features like sorting, filtering, and removing duplicates allow users to quickly structure datasets for analysis. Text functions such as TRIM, CONCATENATE, and LEFT/RIGHT enable precise manipulation of string data, while conditional formatting highlights anomalies or trends for further investigation.

Data Visualization

Effective data communication is a cornerstone of data science and analytics, and Excel’s visualization capabilities play a significant role. Its charting tools allow users to create a wide range of visuals, from pie charts, bar and line graphs to scatter plots and histograms, with just a few clicks. Pivot Charts, paired with PivotTables, enable dynamic exploration of data, allowing analysts to uncover patterns and present findings in a visually appealing manner.

While Excel’s visualizations may not match the sophistication of tools like Tableau, they are sufficient for many business applications. For example, a marketing analyst can use Excel to create a dashboard tracking campaign performance metrics, such as click-through rates and conversions, without needing advanced software. The ability to customize charts with colors, labels, and trendlines further enhances Excel’s utility in conveying insights to stakeholders.

PivotTables for Dynamic Analysis

PivotTables are among Excel’s most powerful features for data analytics. They allow users to summarize, group, and filter large datasets with a lot of ease, providing a dynamic way to explore data without altering the original dataset. For example, a sales analyst can use a PivotTable to aggregate revenue by region, product, or time period, identifying top-performing categories in seconds.

The flexibility of PivotTables makes them ideal for ad-hoc analysis, where stakeholders require quick answers to specific questions. By combining PivotTables with slicers and timelines, analysts can create interactive reports that enable non-technical users to explore data independently. This functionality bridges the gap between raw data and actionable insights, reinforcing Excel’s importance in data-driven decision-making.

Automation with VBA and Macros

For repetitive tasks, Excel’s Visual Basic for Applications (VBA) and macro capabilities offer powerful automation. Data scientists and analysts can write scripts to streamline workflows, such as batch-processing multiple files or generating recurring reports. For example, a financial analyst might use VBA to automate the consolidation of monthly budget data from multiple departments, saving hours of manual work.
While VBA requires some programming knowledge, its integration within Excel makes it more accessible than standalone scripting languages. Macros, which record user actions for playback, further simplify automation for non-coders. These features enhance Excel’s efficiency, allowing professionals to focus on analysis rather than repetitive tasks.

Integration with Other Tools

Excel’s compatibility with other data science and analytics tools ensures its relevance in modern workflows. It seamlessly integrates with databases through ODBC connections, allowing analysts to import data from SQL servers or cloud platforms. Excel also supports Power Query, a powerful ETL (Extract, Transform, Load) tool that automates data extraction and transformation from various sources, including APIs and web pages.
For advanced users, Excel serves as a complementary tool alongside Python or R. Data scientists often use Excel to prototype analyses or share results with stakeholders before scaling to more robust platforms. Its ability to export data in formats like CSV or JSON facilitates interoperability, ensuring Excel remains a key component of the data ecosystem.

Limitations and Complementary Role

Despite its strengths, Excel has limitations. It struggles with large datasets, typically slowing down with files exceeding a million rows, and lacks the computational power of specialized tools for complex machine learning or big data analytics. Additionally, its manual processes can introduce errors, particularly in collaborative environments where version control is critical.
However, these limitations do not diminish Excel’s value; rather, they highlight its role as a complementary tool. Data scientists often use Excel for initial exploration and cleaning before transitioning to Python for advanced modeling. Similarly, analysts rely on Excel for quick insights when time or resources are limited, reserving tools like Tableau for enterprise-level reporting.

Enduring Relevance

Excel’s enduring relevance in data science and analytics stems from its versatility, accessibility, and ability to evolve. Microsoft continues to enhance Excel with features like dynamic arrays, improved Power Query, and integration with cloud-based platforms like Microsoft 365. These updates ensure Excel remains competitive in a landscape increasingly dominated by specialized tools.
For students, small businesses, and professionals in non-technical roles, Excel is often the first step into data science and analytics. Its widespread use in industries like finance, retail, and healthcare underscores its practical value. Even as new tools emerge, Excel’s simplicity and familiarity guarantee its place in the data professional’s toolkit.

Conclusion

Microsoft Excel is far more than a spreadsheet application; it is a foundational tool in data science and data analytics. Its capabilities in data organization, visualization, statistical analysis, and automation make it indispensable for professionals at all levels. While it may not replace advanced platforms like Python or Tableau, Excel’s accessibility, versatility, and integration capabilities ensure its continued importance. As the field of data science evolves, Excel remains a trusted ally, empowering users to transform raw data into meaningful insights with ease and efficiency.