DEV Community

Joan
Joan

Posted on

Understanding data engineering with Datacamp

Data Processing: converting raw data into meaningful information.

Data processing Value:

  • Remove unwanted data
  • Optimize memory. process and network costs
  • Convert data from one type to another
  • Organize data
  • To fit into a schema/structure
  • Increase productivity

How data engineers process data:

  • Data manipulation, cleaning and tidying tasks e.g. dealing with missing values
  • Store data in a sanely structured database
  • Create views on top of the database tables for easy access of the database
  • Normalize the data
  • Optimize the performance of the databases e.g. indexing the data for easier retrieve.

Tools used in data processing

Tools used in data processing

Data Processing:

  • can apply to any task listed in data processing.
  • Scheduling holds each piece and organize how they work together.
  • Runs tasks in a specific order and resolves all dependencies correctly.

Scheduling data:
Manually: manual update of the employee data
Automatically :Run at a specific time say update employee table daily at 6AM.
Automatically run if a specified condition is met known as sensor Scheduling

Data Ingestion:
Batches & Streams
Batch processing: Group records at intervals, often cheaper
Steaming: sends individual records right away into the database, new signing in.

Tools used in scheduling

Tools used in scheduling

Parallel computing/processing
It's the basis of modern data processing tools, necessary because of memory and processing power.
How it works:
Split tasks up into several smaller subtasks
Distribute these subtasks over several computing

Benefits and risks of parallel computing
pros

  • Extra processing power
  • reduced memory footprint cons
  • moving data incurs a cost
  • communication time

Cloud Computing vs On premises computing

cloud providers
servers on premises:

  1. Incur cost for equipment's
  2. need space
  3. electrical and maintenance cost
  4. enough power for peak moments
  5. processing power unused at quieter times

Server on the cloud:

  1. Pay as you go
  2. No need for space
  3. use resources we need an d when we need them
  4. closed to the user the better latency

Cloud Computing for Data storage
Data storage
pros
Database reliability: data replication

Top comments (1)

Collapse
 
wanjohichristopher profile image
WanjohiChristopher

Great

Some comments may only be visible to logged-in visitors. Sign in to view all comments.