DEV Community: Samwel Wachira

ULTIMATE GUIDE TO DATA ANALYST

Samwel Wachira — Sat, 31 Aug 2024 08:21:33 +0000

Data is the new oil -Igbo proverb

Are you fascinated by the power of data to drive decision-making and solve complex problems? You're not alone. In our increasingly data-driven world, the role of a data analyst has never been more crucial.

From healthcare and finance to marketing and sports, data analysts are the unsung heroes behind the scenes, turning raw data into actionable insights.

If you want to pursue a career as a data analyst , the following steps will guide on what you need to take.

Programming Languages
Data Visualization tools
Machine Learning basics.
Data analyst projects.
Conclusion

In this article, we will take a look at each points in details, giving you everything you need to get started on your journey to Data Analyst.

1. Programming Languages
Data analysts will usually work with several programming languages, i.e. there is no wrong or right choice. Essentially, you will need to master SQL for querying and manipulating databases, but you will then need to choose between R and Python for your next programming language.

Learning R or Python, you will able to discover some of the libraries such as Pandas, NumPy that can help you with various tasks and grow your programming skills. Using some of the libraries, you will be learning how to import, clean, manipulate and visualize data with your preferred programming language.

2. Data Visualization tools
Consuming large sets of data isn't always straightforward. Sometimes, data sets are so large that it's downright impossible to discern anything useful from them. That's where data visualization comes in.

Data visualization tools provide data visualization designers with an easier way to create visual representation of large data sets.

The data visualizations can be used for variety of purposes: dashboards, annual reports, sales and marketing materials, investor slide decks and anywhere else information need to be interpreted immediately.

The most common visualization tools used is Excel, Power BI, Tableau.

Other visualization tools include Infogram, chartblocks, data wrapper, D3.js, Google charts and fusion charts.

3. Machine Learning Basics
This involves mastering fundamentals of statistics which covers the topics such as probability distributions, hypothesis testing, measure of central tendency.
Key components include feature engineering, data visualization and encoding

4.Data Analyst Projects
Once you have mastered some of the basic essential skills, you will need to start developing them on your own by working on individual projects.

However, you will need to undertake individual projects where everything will be your responsibility: selecting the topic, fetching the necessary data, contemplating the direction of your research, designing the project structure, making and checking the hypothesis, effectively communicating your findings and laying out the way forward.

Practicing your skills and solving real-world problems will give you a solid basis for your future work experience. As a result, projects usually take much more time but they will help you stand out.

5. Conclusion
By now, you should know how to become a data analyst and what you need to do to make your career goal a reality.

However, to get in front of potential employers, you will need to have a portfolio of your work. Use your portfolio to make your passion and interests shine through.

Ideally, demonstrate both your technical and soft skills in your portfolio and design it in your resume or cv. The more you develop your portfolio or resume, the more you can remove the broad-scope, common projects.

Will all this information in hand, it's time for you to go ahead and start learning today.

Happy learning, happy coding!

DATA ENGINEERING ROADMAP FOR BEGINNERS.

Samwel Wachira — Thu, 09 Nov 2023 17:10:35 +0000

As a growing data professional, you’ll probably find yourself developing an ETL solution at some point, and although you might be a Data Analyst/Scientist, the basics of Data Engineering are very key in career growth, mainly cause it combines SQL, Python, Machine Learning, DevOps & Data Visualisation skills.
Simple Definition terms
What is Analytics? : Analytics is defined as any action that converts data into insights.
What is a data architecture?: A Data architecture is the structure that enables the storage, transformation, exploitation, and governance of data.

Definition of Data Engineering
This is the practice of designing and building systems for collecting storing and analyzing data at scale. It is a broad field with applications in just about any industry.
Data Engineering is the complex task of making raw data usable to Data Scientists and groups within an organization.
These pipelines must take data from many disparate sources and collect them into a single warehouse that represents the data uniformly as a single source of truth.

Role of a Data Engineer.
Data engineers design and build pipelines that transform and transport data into a format wherein, by the time it reaches the Data Scientists or other end users it is in a highly usable state.
Data Engineer build data pipelines because they want to get their data into a place from where the business can make, data-driven decisions.

Data Lake
It generally describes a place where you can securely store various types of data of all scales for processing and analytics.
Data Lakes are typically used to drive Data Analytics, Data Science & ML workloads, or batch and streaming data pipelines.
Data Lakes will accept all types of data, they are portable, on premise or in the cloud.
The first line of defense in enterprise data is your data lake. It’s the central, give me whatever data at whatever volume, variety, formats, and velocity you got. I can take it.
Big data, Data Science, and Analytics supporting data-driven decision-making promise to bring unprecedented levels of insight and efficiency to everything from how we work with data to how we work with customers to the search for a cure for cancer — but data science and analytics depend on having access to historical data.
In recognition of this, companies are deploying Big Data Lakes to bring all their data together in one place and start saving history, so data scientists and analysts have access to the information they need to enable data-driven decision making.

A common term in data engineering, is the concept of a data lake.
A data lake brings together data from across the enterprise to a single location, you might get the data from a relational database or from a spreadsheet and store the raw data in a data lake.
One option for this single location to store the raw data is to store it in a cloud storage bucket. S3 bucket.
A data lake is a medium where an organization brings all their data together in one place and starts saving history, so data scientists and analysts have access to the information they need to enable data driven decision-making.
The reason to make raw data available to analysts is so they can perform self service analytics. Self service has been an important mega-trend towards democratization of data.

Datalakehouse

🕯 Access to a company’s data (Self Service) gives rise to big challenge involving Governance and Data Security, Data Compliance; who has access privileges to company data.

Considerations

Does your data lake handle all the data you have?
Can it all fit into a cloud storage bucket? If you have an RDMS; need to put the data in Cloud SQL (managed dB) rather than Cloud Storage.
Can it elastically scale to meet the demand? As your data collected increases, will you run out of disk space (on-premise problem)?
Does it support high throughput ingestion, what is the network bandwidth?
Do you have edge points of presence, is there fine-grained access control to objects? Do users need to seek within a file or is it enough to get a file as a whole?
Cloud Storage is blob storage, so you might need to think about the granularity of what you store.
Can other tools connect easily?

💡 The purpose of a Data Lake, is to make data accessible for analytics.

Data Warehousing
A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics.

A typical data warehouse often includes the following elements:
1.A relational database to store and manage data.
2.An extraction, loading, and transformation (ELT) solution for preparing the data for analysis.
3.Statistical analysis, reporting, and data mining capabilities.
4.Client analysis tools for visualizing and presenting data to business users.

Other, more sophisticated analytical applications that generate actionable information by applying Data Science and artificial intelligence (AI) algorithms, or graph and spatial features that enable more kinds of analysis of data at scale.

Data Lake Architecture
Data from external sources is loaded first into a raw or landing zone, where it is filed in folders that reflect its provenance (for instance, time and source) without further processing. Sometimes also called the raw or staging zone, is used to house raw ingested data.
Then, as appropriate, this data is copied into the gold zone, where it is cleansed, curated, and aggregated.
The gold zone frequently mirrors the landing area, but contains cleansed, enriched, and otherwise processed versions of the raw data.
This zone (gold) is sometimes also called prod to indicate that the data it contains is production-ready, or cleansed to indicate that the data has been run through data quality tools and/or a curation process to clean up (or cleanse) data quality problems.
The work zone, where users run their projects, or the sensitive zone, where data that should be protected is kept in encrypted volumes.
The sensitive zone is sometimes created to keep files containing data that is particularly important to protect from unauthorized viewers, whether because of regulatory requirements or business needs.

Data Lakes vs Data Warehousing.

A data lake is a capture of every aspect of your business operation.

The data is stored in its natural/raw format, usually as object blobs or files.

Retain all data in its native format.
Support all data types and all users.
Adapt to changes easily.
Tends to be application-specific.

How do you take this flexible & large amount of data and do something with it?

Typically loaded only after a use case is defined.
Provide faster insights.
Current/Historical data for reporting.
Tends to have a consistent schema shared across applications.
Take the raw data from a Data Lake & then Process → Organise → Transform then store it in a Data Warehouse.

Modernizing Data Lakes and Data Warehousing with Google Cloud.
Google Cloud Storage Bucket.

A good option for staging all of your raw data in one place, before building transformation pipelines into your data warehouse.
Businesses use Cloud Storage as backup and archival storage for their businesses.
Storing Data in a Cloud Storage Bucket is durable and performant.
For a Data Engineer, you will often use a Cloud Storage Bucket as part of your Data Lake, to store many different raw data files, such as csv, json or avro.
You could then load or query them directly from Big Query; as a data warehouse.

Real-Time Analytics live data.

Batch pipelines are not enough, what if you need real-time analytics on data that arrives continuously & endlessly.
In that case, you might receive the data in Pub Sub, transform it using Data Flow & Stream it using Big Query.

Real-Time Streaming

Stream processing is a data management technique that involves ingesting a continuous data stream to quickly analyze, filter, transform or enhance the data in real-time.

🕯️ Data Pipelines largely perform the cleanup & processing of data. They are responsible for doing the transformations and processing of your data at scale & bring entire systems to life with freshly, newly processed data for analysis.

Challenges of Data Engineers

As a Data Engineer; you’ll usually encounter a few problems when building data pipelines.
You might find it difficult to access the data that you need. You might find that the data, even after you access it, doesn’t have the quality that’s required by the Analytics or Machine Learning Model.
If you plan to build a model & even if the data quality exists, you might find that the transformations require computational resources that might not be available to you.
Challenges around Query performance; being able to run queries & transformations with the computational resources that you have.

1.Data Access

What makes Data Access difficult?
Data in many businesses is siloed by departments & each department creates its own transactional systems to support its own business processes.
So you might have operational systems that correspond to store systems, and have a different operating system maintained by your product warehouses that manage your inventory.
And have a marketing department that manages all the promotions given that you need to do an analytic query on.
Such as;
You need to know how to combine data from the stores. From the promotions & from the inventory levels.
And because all these are stored in separate systems, some that have restricted access, building an analytics system that uses all three of these data sets to answer an ad-hoc query from above could be very difficult.

2.Data Accuracy and Quality.

The second challenge is that of cleaning, formatting, and getting the data ready for insights requires you to build ETL pipelines.
ETL pipelines are usually necessary to ensure data accuracy and quality.
ETL; get the raw data and transform it into a form with which you can actually carry out the necessary analysis.
The cleaned and transformed data are typically stored in a Data Warehouse (not a Data Lake).
A Data Warehouse is a consolidated place to store the data, and all the data are easily joinable and queryable.

In a Data Lake; data is in a raw format, In a Data Warehouse; the data is stored in a way that makes it efficient to query.

Because data becomes useful, only after you clean it up; you should assume that any raw data that you collect from source systems need to be cleaned and transformed.
Transform it into a format that makes it efficient to query.
ETL the data & store it in Data Warehouse.

3.Availability of Computational Resources.
The availability of computational resources can be a challenge if you are on an on-premise system.
Data Engineers would need to manage server and cluster capacity & make sure that enough capacity exists to carry out ETL jobs.
The problem is that the compute needed by these ETL jobs is not constant over time. Very Often, it varies week to week, depending on factors like holidays and promotional sales.
This means that when traffic is low, you are wasting money. And when traffic is high, your jobs are taking way too long.
Once your data is in a Data Warehouse, you need to optimize the queries your users are running, to make the most efficient use of your compute resources.

Big Query — Google
BigQuery is a fully managed enterprise data warehouse that helps you manage and analyze your data with built-in features like machine learning, geospatial
Data Lakes vs Data Warehousing analysis, and business intelligence.
BigQuery’s server-less architecture lets you use SQL queries to answer your organization’s biggest questions with zero infrastructure management.
BigQuery’s scalable, distributed analysis engine lets you query terabytes in seconds and petabytes in minutes.

Big Query is used as a Data Warehouse.

Our operational systems, like our relational databases that store online orders, inventory & promotions, are our raw data sources on the left, note that this isn’t exhaustive.
You could have other source systems that are manual like csv. These upstream data sources get gathered together in a single consolidated position in our data lake, which is designed for durability and high availability.
Once in the Data Lake, the data often needs to be processed via transformations that then output the data into our data warehouse which is ready for use by downstream teams.

Remember, this roadmap is a general guide, and the specific technologies and tools may evolve over time. Continuous learning and adaptability are key characteristics for a successful data engineer.

Happy learning!

DATA SCIENCE ROADMAP 2023-2024 FOR BEGINNERS.

Samwel Wachira — Sun, 01 Oct 2023 15:54:42 +0000

A journey of thousand miles begins with a single step -Chinese Proverb

Are you interested to become a Data Scientist? Take up this Data Science Roadmap Test and assess your level of understanding!

What does Data Scientist do?
Data scientist are responsible for extracting insights and knowledge from data through a combination of statistical analysis,data mining,visualization and machine learning.
They work on complex problems and uses data to inform decision making.

If you would like to learn data science,here is your general roadmap to guide your journey on data science:

Programming skills

Gain proficiency in a programming language such as Python or R which is most commonly used in data science.
Learn and understand fundamental basics such as data types,variables,data structures and algorithms.
Gain deep understanding of SQL(Structure Query Language) to manipulate database.

2.Machine Learning

Learn and understand fundamental concepts of machine learning algorithms such as supervised ,unsupervised and reinforced machine learning algorithms.

3.Mathematics and statistics

Have a strong foundation in mathematics such as linear algebra,probability and calculus as well as statistics concept such as regression analysis and hypothesis testing.

4.Data Visualization and manipulation

Learn and understand visualization tools such as Tableau,Power BI and its libraries to communicate insights effectively.

5.Data preparation and gathering
Learn and understand ETL techniques i.e(Extract,Transform and Load) for data processing.
learn and understand data collection from various sources such as web scrapping,data pipelines,API's and databases.

6.Model Building and Deployment

After learning machine learning,build machine learning models,validate them using appropriate techniques and deploy them using web frameworks such as Django,Flasks or FastAPI.

7.Cloud Platform and Version Control

Explore to various cloud platforms such as Microsoft Azure,AWS and Google Cloud for scalable data processing.
Lastly,learn and understand how to use version control system such as Git to collaborate and track changes to your projects.

Remember that the roadmap is flexible, and you can adapt it based on your goals, interests, and the specific requirements of the data science field you want to enter.

Continuous learning,networking,collaboration with other data scientists and practical experience are key components of a successful data science career.