<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nessy Mputhia</title>
    <description>The latest articles on DEV Community by Nessy Mputhia (@nessynelian).</description>
    <link>https://dev.to/nessynelian</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1169869%2F25509259-21c0-4d0d-adea-087bd3cd0073.jpg</url>
      <title>DEV Community: Nessy Mputhia</title>
      <link>https://dev.to/nessynelian</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nessynelian"/>
    <language>en</language>
    <item>
      <title>Data Engineering for Beginners: A step by step guide.</title>
      <dc:creator>Nessy Mputhia</dc:creator>
      <pubDate>Fri, 10 Nov 2023 12:10:02 +0000</pubDate>
      <link>https://dev.to/nessynelian/data-engineering-for-beginners-a-step-by-step-guide-17p6</link>
      <guid>https://dev.to/nessynelian/data-engineering-for-beginners-a-step-by-step-guide-17p6</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5XUpnDx6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mrw7ckkosef8lfmui8ze.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5XUpnDx6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mrw7ckkosef8lfmui8ze.jpeg" alt="Data Engineering" width="800" height="534"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Introduction.&lt;/strong&gt;&lt;br&gt;
Data engineering is the process of building and designing systems that can help in collecting and analyzing data from multiple sources and in different formats.&lt;br&gt;
With the large amount of data being stored everyday, data engineering is an emerging field that is growing continuously. If you have an interest in building pipelines that allow for transfer of data then you should consider a path in data engineering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Roles and responsibilities of a data engineer.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identifying ways to improve data quality, reliability and efficiency.&lt;/li&gt;
&lt;li&gt;Preparing data for machine learning operations.&lt;/li&gt;
&lt;li&gt;Ensuring data is kept in the correct format.&lt;/li&gt;
&lt;li&gt;Building data pipelines and data warehousing.&lt;/li&gt;
&lt;li&gt;Deploying machine learning models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you should know.&lt;/strong&gt;&lt;br&gt;
To become a data engineer, there are a lot of things you need to learn. This article will give you a step by step guide that will help you become an expert data engineer.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;1. Learn the basics of programming.&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
The best way to kickstart your journey as a data engineer is by learning a programming language. This is because you are able to enhance your ability to solve problems in a structured manner. Python is a nice option to start with in your programming journey. You can start by learning the basics of the language and working on simple projects as well, to enhance your skills.&lt;br&gt;
After learning the basics of python programming you can now focus on pandas, a python library for data manipulation. After learning pandas you will be able to load data, handle missing values in data, manipulate columns in data as well and perform a lot of operations that can be done on data.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;2. Learn SQL&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Since you will be dealing with databases as a data engineer, learning database management systems is essential. With an understanding of python it will be easier to understand SQL and also NoSQL databases. Learn the basics and advance to solving complex queries in database systems.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;3. Data Integration and ETL pipelines.&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Understand ETL(Extract, Transform, Load)  and ELT(Extract, Load , Transform) processes as well and how they work. They basically involve extracting data from a specified source, transforming data into the required format and loading the data into a specific location. These processes are used in data engineering projects.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;4. Big data tools&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
As a data engineer you are required to know how to work with big data. Big data can be batch data or streaming data. Batch data is accumulated as time goes by. To work with such data you need specialized tools like Apache Spark. Learn Apache spark and understand how it can be used to conduct ETL processes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;5. Cloud Computing.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
It has become easier to work with data now that most of it is stored on the cloud. Cloud computing technologies have made it easy to manage complex processes on the cloud. Learning AWS enables you to be able to work with data in the cloud without much struggle.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;6. Data analysis and machine learning.&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Data analysis is the process of understanding and analyzing data so as to gain meaningful insights from it. There are data analytics tools that can be used for this, with python being one of them. Also you can learn data visualization techniques using python or explore other visualization tools such as PowerBI, Tableau which can help you in creating reports and building dashboards.&lt;br&gt;
Learn machine learning using python using machine learning libraries such as sci-kit learn and Tensorflow.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;7. Familiarize yourself with git and GitHub.&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Understand the importance of version control, collaboration in programming and how to use command line interfaces in data engineering&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;8. Projects&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Work on data engineering projects to improve on skills and learn new ones as well. You can also contribute to open source projects on GitHub and collaborate with others. Always remember that the more projects you work on, the more you learn and grow your skills. Keep learning progressively.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>data</category>
    </item>
    <item>
      <title>Exploratory data analysis using data visualization techniques.</title>
      <dc:creator>Nessy Mputhia</dc:creator>
      <pubDate>Mon, 09 Oct 2023 07:10:50 +0000</pubDate>
      <link>https://dev.to/nessynelian/exploratory-data-analysis-using-data-visualization-techniques-1233</link>
      <guid>https://dev.to/nessynelian/exploratory-data-analysis-using-data-visualization-techniques-1233</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--I9-xPNwY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xujco9hictvpd872nmp2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--I9-xPNwY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xujco9hictvpd872nmp2.jpg" alt="Photo from Pinterest" width="736" height="412"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
Exploratory data analysis(EDA) is an approach used to extract meaningful information from data so as to be able to identify patterns and gain a deeper understanding of it.&lt;br&gt;
It mainly uses statistical graphics and other data visualization methods. Various libraries used for EDA such as pandas, matplotlib and seaborn.&lt;/p&gt;

&lt;p&gt;In this article we are going to focus on EDA using data visualization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Visualization&lt;/strong&gt; &lt;br&gt;
Data visualization is a method of presenting data in a visual format to get to know variables and the relationship between them.&lt;br&gt;
Python provides several libraries for data visualization which include matplotlib, seaborn , and plotly among others.&lt;/p&gt;

&lt;p&gt;To select and design a visualization we consider the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The type of data available (Numerical or categorical).&lt;/li&gt;
&lt;li&gt;The number of variables and the questions you want to answer with your data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Univariate analysis.&lt;/strong&gt;&lt;br&gt;
Univariate analysis is the analysis of one variable at a time. In univariate analysis, numerical and categorical data are plotted differently as they require different types of plots.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Categorical data.&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
Categorical data can assume only a limited number of values. Let us look at the types of plots that can be used to visualize categorical data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;Count plots.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Count plots are used to quickly visualize the frequency of values in each category in form of a bar graph. Each category is presented in a separate bar. We can say it is a visual representation of the pandas value counts function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;Pie charts.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
They are similar to the count plot but show the percentage of each category in the data. They are preferred when visualizing variables with few number of categories.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Numerical data&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
This is data that can be quantified. Numerical data can be continuous or discrete.&lt;br&gt;
Continuous data has an infinite number of values while discrete data has a finite number of distinct values.&lt;br&gt;
The analysis of numerical data is vital because it helps in further processing of the data.&lt;br&gt;
Numerical data can be presented visually by the use of:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;Histogram&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
It is the distribution plot of numerical columns that creates bins with a range of values and plots. It can help visualize how values are distributed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;Dist plot.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Dist plots are similar to the histogram but have a slight improvement, It gives us a kernel Density Estimation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;Boxplot&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Boxplots display the summary of data in five numbers, minimum, maximum and the  25th, 50th and 75th percentiles.&lt;br&gt;
To get 5 number summary some terms we need to describe.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Median – which is the middle value in a series after sorting&lt;/li&gt;
&lt;li&gt;Percentile – which gives any number which is the number of values present before this percentile.&lt;/li&gt;
&lt;li&gt;Minimum and Maximum – they describe the lower and upper boundary of standard deviation which is calculated using Interquartile range(IQR).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bivariate Analysis.&lt;/strong&gt; &lt;br&gt;
Bivariate analysis is the analysis of two variables and is essential for understanding the relationship between them.&lt;br&gt;
Depending on the type of variables, different visualizations can be used for bivariate analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;Numerical and numerical.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
When both variables are numerical we can use scatter plots. They are a great choice for presenting the relationship between two variables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;Numerical and Categorical.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
For bivariate analysis of numerical and categorical variables we can use box plots, scatterplots, overlapping histograms(density plots), bar plots and dist plots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;Categorical and categorical&lt;/em&gt;.&lt;/strong&gt;&lt;br&gt;
Stacked bar plots, cluster maps and heatmaps can be used in the visualization of two categorical variables..&lt;br&gt;
Cluster maps and heatmaps are often used for visualizing the relationship between categorical variables.&lt;br&gt;
Heat maps basically show how much the presence of one category affects the presence of another category in the dataset.&lt;br&gt;
Cluster maps plot a dendrogram which show categories with similar behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multivariate analysis.&lt;/strong&gt;&lt;br&gt;
Multivariate analysis is the analysis of multiple variables at a time. Bar plots, scatter plots and box plots can be used.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;Multivariate analysis using scatterplots.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Scatterplots, in this case, are used, using different visual cues like different colors, shapes and patterns.&lt;br&gt;
Scatterplots are effective for visualizing the relationships between two numerical variables. Adding different visual cues such as colors, shapes, or patterns to data points can help you incorporate additional information. You can use different cues to represent a third or even fourth numerical variable, making it easier to understand how multiple factors relate to each other. For example, different colors might represent different categories or groups within your data, allowing you to see how they interact with two numerical variables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;Multivariate analysis using bar plots.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Bar plots with the hue argument in their syntax are used to present the analysis of more than two variables.&lt;br&gt;
Bar plots are useful for comparing categories or groups in your data, and the "hue" argument allows you to introduce a third categorical variable. It can help you see how a third factor affects the relationship between two other variables. For instance, if you have data on sales (a numerical variable) for different products (a categorical variable) in different regions (another categorical variable), you can create a bar plot with the "hue" argument to visualize how the regions impact the sales of products within those regions. This adds depth to your analysis.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>learning</category>
    </item>
    <item>
      <title>Data Science for Beginners: 2023 - 2024 Complete Roadmap</title>
      <dc:creator>Nessy Mputhia</dc:creator>
      <pubDate>Sun, 01 Oct 2023 20:35:51 +0000</pubDate>
      <link>https://dev.to/nessynelian/data-science-for-beginners-2023-2024-complete-roadmap-17fk</link>
      <guid>https://dev.to/nessynelian/data-science-for-beginners-2023-2024-complete-roadmap-17fk</guid>
      <description>&lt;p&gt;Data science involves the use of various techniques, processes and systems to extract valuable insights from data. To analyze complex datasets using data science computer science, statistics, mathematics, domain expertise and data engineering are combined.&lt;/p&gt;

&lt;p&gt;Getting started in data science involves:&lt;br&gt;
&lt;strong&gt;Mathematics fundamentals&lt;/strong&gt;&lt;br&gt;
Start by learning basic mathematics.&lt;br&gt;
This includes linear algebra, calculus and statistics which  is crucial for data analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Programming.&lt;/strong&gt;&lt;br&gt;
Learn the basics of a programming language for data science. This can be either python or R programming language. In this roadmap I'll discuss python programming language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exploratory data analysis&lt;/strong&gt; &lt;br&gt;
After learning and understanding the basics, study libraries like Pandas and NumPy which are used for data analysis and manipulation.&lt;br&gt;
Also explore data analysis and visualization tools and libraries like matplotlib and seaborne which will help you to effectively communicate your findings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Machine Learning basics.&lt;/strong&gt;&lt;br&gt;
Learn the fundamentals of machine learning and common algorithms used for building predictive models and recommendations based on data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deep Learning.&lt;/strong&gt;&lt;br&gt;
Understand libraries such as TensorFlow and PyTorch if you're interested in neural networks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature Engineering.&lt;/strong&gt;&lt;br&gt;
Understand how to extract meaningful features from raw data to build a better performing model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model building and Evaluation&lt;/strong&gt;.&lt;br&gt;
Familiarize yourself with machine learning libraries such as scikit-learn in python.&lt;br&gt;
Build machine learning models and evaluate their performance. Also, understand cross validation techniques to ensure your model generalizes well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Deployment&lt;/strong&gt;&lt;br&gt;
Learn how to deploy models to production for use in real world applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Projects&lt;/strong&gt;&lt;br&gt;
Work on projects in your areas of interest to practice the newly learned skills. This way you'll learn how to solve different kinds of real world problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Continuous learning&lt;/strong&gt;&lt;br&gt;
Keep up with the latest trends in data science by reading books, blogs and also taking online courses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network&lt;/strong&gt;&lt;br&gt;
Join data science communities, attend events and collaborate with others to learn and grow.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
