DEV Community: Ramses Alexander Coraspe

Working with large CSV files in Python from Scratch

Ramses Alexander Coraspe — Wed, 21 Dec 2022 01:02:14 +0000

check this out:

https://coraspe-ramses.medium.com/working-with-large-csv-files-in-python-from-scratch-134587aed5f7

Schema Inference for Large .CSV files

Ramses Alexander Coraspe — Thu, 21 Jul 2022 03:54:43 +0000

A tool to automatically infer columns data types in .csv files

The tests were done with 9 .csv files, 21 columns, different sizes and number of records, an average of 5 executions was calculated for each process, shuffle time and inferring time.

file__20m.csv: 20 million records
file__15m.csv: 15 million records
file__12m.csv: 12 million records
file__10m.csv: 10 million records
And so on...

If you want to know more about the shuffling process, you can check this other repository: A tool to automatically Shuffle lines in .csv files, the shuffling process will helps us to:

Increase the probability of finding all the data types present in a single column.
Avoid iterate the entire dataset.
Avoid see biases in the data that may be part of its organic behavior and due to not knowing the nature of its construction.

Shuffle lines in .csv files

Ramses Alexander Coraspe — Sun, 17 Jul 2022 20:31:22 +0000

A tool to automatically Shuffle lines in .csv files

https://github.com/Wittline/csv-shuffler

Schema Inference for Large CSV files

Ramses Alexander Coraspe — Sat, 09 Jul 2022 15:55:32 +0000

https://github.com/Wittline/csv-schema-inference

Data Engineering Projects for Beginners

Ramses Alexander Coraspe — Wed, 15 Jun 2022 23:40:58 +0000

Hi everyone,

I am a little bit obsessed with data engineering and lately I have been working on several open source projects about this topic, here is a list of repositories and technologies used in each one, if you decide to go deeper into this funny world then these repositories could help you as a guide.

❤ means "I like this one"

❤ Tracking your Uber Rides and Uber Eats expenses through a data engineering process

Technologies and skills:

Python, Docker, Apache Airflow, AWS Redshift, Power BI, data modelling, Task schedulling, ETL and ELT processes, Data warehousing, Cloud

❤ Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag

Technologies and skills:

Python, Docker, Big Data, Cloud, BigQuery, Workflow Engines, GCP, Task scheduler, Google Cloud Platform, Dataproc cluster, GCS, Google Cloud Storage, Redis, DAG, Parallel Processing, Apache Spark

❤ Building Big Data Pipelines in the Cloud with AWS EMR

Technologies and skills:

Python, PySpark, AWS EMR, Task Schedulling, IAC, EC2 Instances, Apache Spark, Cloud

❤ Building a Lossless Data Compression and Data Decompression Pipeline

Technologies and skills:

Python, Data compression, BZIP2, Parallel programming

Learn how to dockerize an Apache Spark Standalone Cluster

Technologies and skills:

Python, Jupyter Notebook, Apache Spark, Docker, docker-compose, Hive

❤ Dockerizing and Consuming an Apache Livy environment

Technologies and skills:

Python, Big Data, Docker, docker-compose, Apache Livy, Apache Spark, PostgreSQL, PySpark, Jupyter Notebook

❤ Design, Development and Deployment of a simple Data Pipeline

Technologies and skills:

Python, data Modelling, Docker, docker-compose, PostgreSQL, data pipeline, FastApi

Dockerizing a Python Script for Faster Web Scraping

Technologies and skills:

Python, Docker, Sqlite, Dockerfile, Web scraping, Data pipeline, FastApi

Understanding Similarity Measures for Text Analysis

Technologies and skills:

Python, Machine Learning, Similarity measures, Distance metrics, Text Analysis

❤ Learn how to build a content-based Movie Recommender System

Technologies and skills:

Python, Machine Learning, TF-IDF, Cosine similarity, BM25, BERT, NLP, word2vec, Text Analysis, recsys

A Text Analysis of Speeches

Technologies and skills:

Python, Machine Learning, NLP, word2vec, Text Analysis, Sentiment Analysis, PCA, t-SNE, Word Embeddings, Text Preprocessing, Web scraping, Data Visualization, Mexico

❤ Dropout Students Prediction

Technologies and skills:

R, Genetic algorithm, Neural Networks, K-Means, Clustering, Machine Learning

I will be working on more complex projects in the next months using modern tech data stacks.

Architecture of an Amazon Redshift cluster

Ramses Alexander Coraspe — Wed, 15 Jun 2022 17:33:37 +0000

The image above shows the basic architecture of an Amazon Redshift cluster, it is summarized below:

The total number of nodes in the redshift cluster is equal to the number of "EC2 instances" used in the cluster.
Each slice in a redshift cluster is at least 1 CPU with dedicated memory and storage.
The image above shows a cluster with 4 nodes, each one contains 4 slices, the maximum number of partitions per table is 16 partitions.
The leader node (Leader Node), is responsible for coordinating lower level nodes, manages external communications and optimizes queries.
The lower level nodes, slave nodes (Compute nodes), as mentioned above, each slave node has its own CPU, memory, and disk, depending on the type of EC2 instance selected, this architecture has the ability to "scale out (add more nodes to the cluster)" or "scale up (add more resources to a specific node)".

TF-IDF FROM SCRATCH

Ramses Alexander Coraspe — Sun, 19 Sep 2021 07:48:16 +0000

https://github.com/Wittline/tf-idf