<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ramses Alexander Coraspe</title>
    <description>The latest articles on DEV Community by Ramses Alexander Coraspe (@ramsescoraspe).</description>
    <link>https://dev.to/ramsescoraspe</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F707989%2F720cbccd-dbb9-4647-b4e5-51f913dd74f4.jpg</url>
      <title>DEV Community: Ramses Alexander Coraspe</title>
      <link>https://dev.to/ramsescoraspe</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ramsescoraspe"/>
    <language>en</language>
    <item>
      <title>Working with large CSV files in Python from Scratch</title>
      <dc:creator>Ramses Alexander Coraspe</dc:creator>
      <pubDate>Wed, 21 Dec 2022 01:02:14 +0000</pubDate>
      <link>https://dev.to/ramsescoraspe/working-with-large-csv-files-in-python-from-scratch-3eab</link>
      <guid>https://dev.to/ramsescoraspe/working-with-large-csv-files-in-python-from-scratch-3eab</guid>
      <description>&lt;p&gt;check this out:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fft32ncctt93bpr9yvcnv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fft32ncctt93bpr9yvcnv.png" alt="CSV files" width="786" height="581"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://coraspe-ramses.medium.com/working-with-large-csv-files-in-python-from-scratch-134587aed5f7" rel="noopener noreferrer"&gt;https://coraspe-ramses.medium.com/working-with-large-csv-files-in-python-from-scratch-134587aed5f7&lt;/a&gt;&lt;/p&gt;

</description>
      <category>tutorial</category>
    </item>
    <item>
      <title>Schema Inference for Large .CSV files</title>
      <dc:creator>Ramses Alexander Coraspe</dc:creator>
      <pubDate>Thu, 21 Jul 2022 03:54:43 +0000</pubDate>
      <link>https://dev.to/ramsescoraspe/csv-schema-inference-4c93</link>
      <guid>https://dev.to/ramsescoraspe/csv-schema-inference-4c93</guid>
      <description>&lt;p&gt;A tool to automatically infer columns data types in .csv files&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Wittline/csv-schema-inference"&gt;csv-schema-inference&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The tests were done with 9 .csv files, 21 columns, different sizes and number of records, an average of 5 executions was calculated for each process, shuffle time and inferring time.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;file__20m.csv: 20 million records&lt;/li&gt;
&lt;li&gt;file__15m.csv: 15 million records&lt;/li&gt;
&lt;li&gt;file__12m.csv: 12 million records&lt;/li&gt;
&lt;li&gt;file__10m.csv: 10 million records&lt;/li&gt;
&lt;li&gt;And so on...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to know more about the shuffling process, you can check this other repository: &lt;a href="https://github.com/Wittline/csv-shuffler"&gt;A tool to automatically Shuffle lines in .csv files&lt;/a&gt;, the shuffling process will helps us to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increase the probability of finding all the data types present in a single column.&lt;/li&gt;
&lt;li&gt;Avoid iterate the entire dataset.&lt;/li&gt;
&lt;li&gt;Avoid see biases in the data that may be part of its organic behavior and due to not knowing the nature of its construction.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Hnn3BFkF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h8ak5xtt8eossqtphm9o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Hnn3BFkF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h8ak5xtt8eossqtphm9o.png" alt="Benchmark" width="396" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>bigdata</category>
      <category>csv</category>
      <category>dataset</category>
    </item>
    <item>
      <title>Shuffle lines in .csv files</title>
      <dc:creator>Ramses Alexander Coraspe</dc:creator>
      <pubDate>Sun, 17 Jul 2022 20:31:22 +0000</pubDate>
      <link>https://dev.to/ramsescoraspe/shuffle-lines-in-csv-files-4m7m</link>
      <guid>https://dev.to/ramsescoraspe/shuffle-lines-in-csv-files-4m7m</guid>
      <description>&lt;p&gt;A tool to automatically Shuffle lines in .csv files&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hXvzMIaD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o4imjak8nebknpbkbz2i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hXvzMIaD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o4imjak8nebknpbkbz2i.png" alt="SHUFFLE CSV" width="880" height="186"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Wittline/csv-shuffler"&gt;https://github.com/Wittline/csv-shuffler&lt;/a&gt;&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>dataengineering</category>
      <category>csv</category>
    </item>
    <item>
      <title>Schema Inference for Large CSV files</title>
      <dc:creator>Ramses Alexander Coraspe</dc:creator>
      <pubDate>Sat, 09 Jul 2022 15:55:32 +0000</pubDate>
      <link>https://dev.to/ramsescoraspe/schema-inference-for-large-csv-files-2683</link>
      <guid>https://dev.to/ramsescoraspe/schema-inference-for-large-csv-files-2683</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QDR8JyEr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/83ysxv936yq9m51rygrq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QDR8JyEr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/83ysxv936yq9m51rygrq.png" alt="data pipeline" width="880" height="1294"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Wittline/csv-schema-inference"&gt;https://github.com/Wittline/csv-schema-inference&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Data Engineering Projects for Beginners</title>
      <dc:creator>Ramses Alexander Coraspe</dc:creator>
      <pubDate>Wed, 15 Jun 2022 23:40:58 +0000</pubDate>
      <link>https://dev.to/ramsescoraspe/data-engineering-projects-for-beginners-4d56</link>
      <guid>https://dev.to/ramsescoraspe/data-engineering-projects-for-beginners-4d56</guid>
      <description>&lt;p&gt;Hi everyone,&lt;/p&gt;

&lt;p&gt;I am a little bit obsessed with data engineering and lately I have been working on several open source projects about this topic, here is a list of repositories and technologies used in each one, if you decide to go deeper into this funny world then these repositories could help you as a guide.&lt;/p&gt;

&lt;p&gt;❤ means "I like this one"&lt;/p&gt;

&lt;h3&gt;
  
  
  ❤ &lt;strong&gt;&lt;a href="https://github.com/Wittline/uber-expenses-tracking"&gt;Tracking your Uber Rides and Uber Eats expenses through a data engineering process&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Python, Docker, Apache Airflow, AWS Redshift, Power BI, data modelling, Task schedulling, ETL and ELT processes, Data warehousing, Cloud&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ❤ &lt;strong&gt;&lt;a href="https://github.com/Wittline/pyDag"&gt;Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Python, Docker, Big Data, Cloud, BigQuery, Workflow Engines, GCP, Task scheduler, Google Cloud Platform, Dataproc cluster, GCS, Google Cloud Storage, Redis, DAG, Parallel Processing, Apache Spark&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ❤ &lt;strong&gt;&lt;a href="https://github.com/Wittline/pyspark-on-aws-emr"&gt;Building Big Data Pipelines in the Cloud with AWS EMR&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Python, PySpark, AWS EMR, Task Schedulling, IAC, EC2 Instances, Apache Spark, Cloud&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ❤ &lt;strong&gt;&lt;a href="https://github.com/Wittline/wbz"&gt;Building a Lossless Data Compression and Data Decompression Pipeline&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Python, Data compression, BZIP2, Parallel programming&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://github.com/Wittline/apache-spark-docker"&gt;Learn how to dockerize an Apache Spark Standalone Cluster&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Python, Jupyter Notebook, Apache Spark, Docker, docker-compose, Hive&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ❤ &lt;strong&gt;&lt;a href="https://github.com/Wittline/docker-livy"&gt;Dockerizing and Consuming an Apache Livy environment&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Python, Big Data, Docker, docker-compose, Apache Livy, Apache Spark, PostgreSQL, PySpark, Jupyter Notebook&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ❤ &lt;strong&gt;&lt;a href="https://github.com/Wittline/data-engineer-challenge"&gt;Design, Development and Deployment of a simple Data Pipeline&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Python, data Modelling, Docker, docker-compose, PostgreSQL, data pipeline, FastApi&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://github.com/Wittline/data-engineering-challenge-th"&gt;Dockerizing a Python Script for Faster Web Scraping&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Python, Docker, Sqlite, Dockerfile, Web scraping, Data pipeline,  FastApi&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://github.com/Wittline/distance-metrics"&gt;Understanding Similarity Measures for Text Analysis&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Python, Machine Learning, Similarity measures, Distance metrics, Text Analysis&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ❤ &lt;strong&gt;&lt;a href="https://github.com/Wittline/recommendation-system"&gt;Learn how to build a content-based Movie Recommender System&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Python, Machine Learning, TF-IDF, Cosine similarity, BM25, BERT, NLP, word2vec, Text Analysis, recsys&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;a href="https://github.com/Wittline/text-analysis-speeches-amlo"&gt;A Text Analysis of Speeches&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Python, Machine Learning, NLP, word2vec, Text Analysis, Sentiment Analysis, PCA, t-SNE, Word Embeddings, Text Preprocessing, Web scraping, Data Visualization, Mexico&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ❤ &lt;strong&gt;&lt;a href="https://github.com/Wittline/Dropout-Students-Prediction"&gt;Dropout Students Prediction&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Technologies and skills&lt;/strong&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;R, Genetic algorithm, Neural Networks, K-Means, Clustering, Machine Learning&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  I will be working on more complex projects in the next months using modern tech data stacks.
&lt;/h2&gt;

</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>python</category>
    </item>
    <item>
      <title>Architecture of an Amazon Redshift cluster</title>
      <dc:creator>Ramses Alexander Coraspe</dc:creator>
      <pubDate>Wed, 15 Jun 2022 17:33:37 +0000</pubDate>
      <link>https://dev.to/ramsescoraspe/architecture-of-an-amazon-redshift-cluster-4lp0</link>
      <guid>https://dev.to/ramsescoraspe/architecture-of-an-amazon-redshift-cluster-4lp0</guid>
      <description>&lt;p&gt;&lt;strong&gt;The image above shows the basic architecture of an Amazon Redshift cluster, it is summarized below:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;The total number of nodes in the redshift cluster is equal to the number of  "EC2 instances" used in the cluster.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Each slice in a redshift cluster is at least 1 CPU with dedicated memory and storage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The image above shows a cluster with 4 nodes, each one contains 4 slices, the maximum number of partitions per table is 16 partitions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The leader node (Leader Node), is responsible for coordinating lower level nodes, manages external communications and optimizes queries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The lower level nodes, slave nodes (Compute nodes), as mentioned above, each slave node has its own CPU, memory, and disk, depending on the type of EC2 instance selected, this architecture has the ability to "&lt;strong&gt;scale out&lt;/strong&gt; (add more nodes to the cluster)" or "&lt;strong&gt;scale up&lt;/strong&gt; (add more resources to a specific node)".&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>aws</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>TF-IDF FROM SCRATCH</title>
      <dc:creator>Ramses Alexander Coraspe</dc:creator>
      <pubDate>Sun, 19 Sep 2021 07:48:16 +0000</pubDate>
      <link>https://dev.to/ramsescoraspe/tf-idf-from-scratch-3f3k</link>
      <guid>https://dev.to/ramsescoraspe/tf-idf-from-scratch-3f3k</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wBAo1P2B--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bk90u7p5wkvrsrf5tlgx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wBAo1P2B--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bk90u7p5wkvrsrf5tlgx.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sXTStqe3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g2gsi2a8qzmud8kkywuo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sXTStqe3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g2gsi2a8qzmud8kkywuo.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Wittline/tf-idf"&gt;https://github.com/Wittline/tf-idf&lt;/a&gt;&lt;/p&gt;

</description>
      <category>tfidf</category>
      <category>nlp</category>
      <category>featureengineering</category>
    </item>
  </channel>
</rss>
