DEV Community: Arnaud PELAMA PELAMA TIOGO

Analyzing Large E-commerce Data: My Journey into Data Science

Arnaud PELAMA PELAMA TIOGO — Tue, 08 Oct 2024 09:38:28 +0000

As part of my journey to becoming a Data Scientist, I recently completed an exciting project focused on analyzing a large e-commerce dataset. The project pushed my data analysis and visualization skills to new levels, and I’d love to share some of the challenges, insights, and growth I experienced during this process.

What I Learned

This project was a deep dive into large-scale data analysis. I had the opportunity to work with a synthetic e-commerce dataset containing 1 million transactions, which mimicked real-world big data scenarios. Here are some of the key takeaways from the project:

Data Cleaning: Handling missing data and outliers in a dataset of this magnitude was one of the primary challenges. I utilized a combination of imputation techniques (mean, median, and interpolation) and statistical analysis to manage anomalies. This reinforced my understanding of handling real-world messy data.
Data Manipulation and Aggregation: I became more proficient at grouping data and performing summary statistics with Pandas. I applied advanced techniques like multi-level aggregations and pivot tables to extract insights from the dataset. This taught me the importance of efficient data manipulation in large-scale projects.
Data Visualization: Creating meaningful visualizations from such a large dataset required a more structured approach. I used Matplotlib and Seaborn extensively to plot sales trends and customer demographics. This helped me improve my skills in visual storytelling through data.

Challenges I Faced

Data Scale: Working with 1 million rows of data required optimization in both code and tools. Loading the dataset efficiently, ensuring memory management, and executing computationally expensive tasks like group-bys and aggregations challenged me to focus on performance.
Handling Missing and Anomalous Data: Given the size of the dataset, dealing with missing and erroneous data took time and careful planning. Different imputation strategies were needed for different types of anomalies, and this highlighted the importance of tailored data cleaning strategies.
Maintaining Code Structure: As the project grew, organizing code into reusable functions became vital. Modularizing the code was key to keeping it clean, readable, and scalable.

Growth Through the Project

This project marked a significant milestone in my growth as a data scientist. Through it, I:

Enhanced my ability to handle large datasets efficiently.
Strengthened my understanding of data cleaning techniques, particularly in handling missing values and outliers in big data.
Improved my proficiency with data manipulation tools like Pandas for aggregation and summarization.
Gained experience in data visualization using advanced tools like Seaborn, where I focused on creating impactful plots that tell a story.
Grew my ability to generate synthetic datasets, which is crucial for testing models and simulation in data science.

Versatility in the Project

One of the strengths of this project was the versatility of tasks and concepts covered, from data generation to cleaning, summarization, and visualization. This end-to-end project offered me a holistic view of the data science process, including:

Data cleaning (handling missing values, imputing, and outlier detection).
Data manipulation (filtering, sorting, and summarizing large datasets).
Aggregated analysis (group-bys, pivot tables, and multi-level aggregation).
Data visualization (sales trends, customer segmentation).
Random data simulation (for dataset creation and testing).

This range of topics mirrors the versatility required in real-life data science projects, where a data scientist often needs to wear many hats and work across multiple aspects of the data pipeline.

Seeking Feedback

I’d love to hear feedback from the community! Whether it’s on how I could optimize my code further, alternative techniques for handling missing data, or even suggestions for the visualizations—every piece of feedback is valuable. Feel free to share your thoughts or ask questions in the comments section below!

What’s Next?
As I continue my journey, my next steps include incorporating machine learning models into my projects, analyzing real-world datasets, and exploring customer segmentation techniques.

Thanks for reading! If you have any insights or suggestions for this project, I’m excited to learn from the community.

Analyzing and Simulating Stock Prices: My Journey into Data Science

Arnaud PELAMA PELAMA TIOGO — Tue, 01 Oct 2024 15:11:50 +0000

As I continue my journey into data science, I recently completed a capstone project titled "Stock Price Analysis and Simulation". This project was both challenging and rewarding, pushing me to apply and expand my skills in data cleaning, statistical analysis, visualization, and more. In this blog post, I'd like to reflect on what I learned, the challenges I faced, and how this project has contributed to my growth as an aspiring data scientist.

Project Overview

The primary goal of this project was to analyze and simulate stock price data over a 20-year period. The dataset included various anomalies and missing values to mimic real-world data issues. The key objectives were:

Data Cleaning: Handling missing values and anomalies.
Data Manipulation: Calculating daily returns and adding new features.
Statistical Analysis: Computing mean, median, standard deviation, and identifying significant trends.
Data Visualization: Creating plots to visualize stock prices and returns.
Random Simulation: Simulating future stock price paths using statistical methods.

You can find the full project on my GitHub repository.

What I Learned

Data Cleaning and Preprocessing

One of the first challenges I encountered was handling the large dataset that spanned 20 years. The dataset included intentional anomalies such as 'missing', 'error', and None values. I learned how to:

Use Pandas to identify and replace invalid entries with np.nan.
Convert data types appropriately, ensuring numerical columns were indeed numerical.
Apply interpolation methods to fill in missing values effectively.

Key Takeaway: Real-world data is messy. Developing robust data cleaning skills is essential for any data scientist.

Statistical Analysis

Calculating statistical measures like mean, median, and standard deviation was straightforward. However, interpreting these statistics in the context of stock prices required deeper understanding.

Identified days with unusually high returns by setting thresholds.
Gained insights into the stock's volatility over the 20-year period.

Key Takeaway: Statistical analysis is not just about computation but also about deriving meaningful insights from the results.

Data Visualization

Creating clear and informative visualizations was crucial for interpreting the data.

Used Matplotlib and Seaborn to create line plots and histograms.
Learned how to customize plots for better readability (e.g., adjusting figure sizes, labels, and titles).
Saved visualizations programmatically to the appropriate directories.

Key Takeaway: Effective visualization amplifies the impact of data analysis by making complex data understandable at a glance.

Simulating Stock Prices

Implementing a random walk model to simulate future stock prices was both exciting and challenging.

Used historical daily returns to inform the simulation.
Generated random returns based on the calculated mean and standard deviation.
Visualized the simulated stock price path over the next 30 days.

Key Takeaway: Simulation models can provide valuable forecasts but should be interpreted cautiously due to inherent uncertainties.

Code Modularity and Documentation

To enhance the readability and reusability of my code, I:

Broke down the script into modular functions such as load_data(), clean_data(), perform_statistical_analysis(), etc.
Added docstrings and comments to explain the purpose of each function and code block.

Key Takeaway: Writing clean, well-documented code is essential, especially for collaborative projects and future reference.

Challenges Faced

Handling Large Datasets

Working with a dataset that spanned 20 years posed performance challenges.

Memory Management: Ensured efficient use of memory by processing data in chunks when necessary.
Computation Time: Optimized code to reduce execution time, especially during simulations.

Dealing with Anomalies

The intentional introduction of anomalies required robust cleaning methods.

Data Type Conflicts: Encountered errors when mixing data types within columns, necessitating careful type conversions.
Interpolation Limitations: Had to choose appropriate interpolation methods that made sense in the context of stock prices.

Simulation Accuracy

Balancing the simplicity of the random walk model with the complexity of real stock movements was tricky.

Acknowledged that the model doesn't account for market factors and external events.
Considered incorporating more sophisticated models in future iterations.

Demonstrated Growth

Looking back, this project was a significant step forward in my data science journey.

Technical Skills: Enhanced my proficiency in Pandas, NumPy, Matplotlib, and Seaborn.
Problem-Solving: Developed strategies to tackle data-related challenges systematically.
Project Management: Improved my ability to plan, execute, and document a comprehensive data analysis project.

Highlighting Versatility

This project allowed me to explore various aspects of data science:

Data Engineering: From data generation to cleaning and preprocessing.
Statistical Modeling: Applying statistical concepts to real-world data.
Visualization: Communicating findings effectively through visual means.
Programming Best Practices: Writing modular, maintainable code with proper documentation.

Seeking Feedback

I'm eager to learn and grow further. If you have any suggestions, critiques, or insights, I'd love to hear from you!

GitHub: Feel free to open an issue or submit a pull request.
Comments: Leave a comment below with your thoughts or questions.
Contact: You can reach me at pelama.arnaud@gmail.com.

Conclusion

Embarking on the Stock Price Analysis and Simulation project was both challenging and enlightening. It not only solidified my existing knowledge but also pushed me to acquire new skills. I'm excited to continue this journey and take on more complex projects in the future.

Thank you for taking the time to read about my experience. Your feedback and support are invaluable!

Happy coding!

Analyzing NYC Schools' SAT Performance: Reflections on this project

Arnaud PELAMA PELAMA TIOGO — Wed, 18 Sep 2024 15:15:24 +0000

As part of my journey to becoming a Data Scientist, I've recently completed a project titled NYC Schools Analysis. This project involved analyzing SAT performance data of New York City schools to identify top-performing schools and boroughs. In this blog post, I'd like to reflect on what I learned, the challenges I faced, how this project contributed to my growth, and seek feedback from the community.

What I Learned

Data Cleaning and Preprocessing

One of the critical aspects of this project was data cleaning. I learned how to handle missing values, ensure correct data types, and remove duplicate entries. This process was essential to prepare the data for accurate analysis.

Data Analysis with Pandas and NumPy

Using pandas and NumPy, I performed statistical analyses to:

Identify schools with the best math results (at least 80% of the maximum possible score).
Determine the top 10 performing schools based on combined SAT scores.
Find the borough with the largest standard deviation in combined SAT scores.

This enhanced my understanding of data manipulation and statistical calculations in Python.

Data Visualization with Matplotlib

Creating visual representations of data was a significant learning point. I used Matplotlib to generate:

Bar charts showing the top 10 schools.
Histograms of combined SAT scores.
Box plots to visualize SAT score distributions by borough.

These visualizations helped in conveying insights more effectively.

Challenges Faced

Handling Limited Data

Initially, I worked with a small dataset represented as a Python dictionary. This limited the depth of analysis. To overcome this, I expanded the dataset by adding more schools and varying the scores to simulate a more realistic scenario.

Data Cleaning Complexities

Ensuring data integrity was challenging. Dealing with missing values and potential data entry errors required meticulous attention. I had to decide whether to impute missing values or exclude certain data points, balancing between data accuracy and completeness.

Visualization Nuances

Creating meaningful visualizations was more complex than anticipated. Choosing the right type of chart and customizing it for clarity took several iterations. Aligning the visual style to make the plots both informative and aesthetically pleasing was a valuable exercise.

Demonstrating Growth

This project was a significant milestone in my learning journey. Here's how it contributed to my growth:

Enhanced Technical Skills

Pandas and NumPy: Deepened my ability to manipulate and analyze data using these libraries.
Matplotlib: Improved my skills in data visualization, which is crucial for data storytelling.

Improved Code Organization

By modularizing the code into functions such as load_data(), clean_data(), and visualize_data(), I learned the importance of code reusability and readability.

Real-World Application Awareness

Working on this project bridged the gap between theoretical knowledge and real-world application. It provided insights into how data science can impact educational insights and policy-making.

Highlighting Versatility

This project covered a range of topics and challenges:

Data Cleaning: Handling missing values, data types, and duplicates.
Statistical Analysis: Calculating means, standard deviations, and interpreting them.
Data Visualization: Creating various charts to represent data insights.
Python Programming: Writing efficient, modular code with proper documentation.

By integrating these elements, I developed a more holistic understanding of the data science workflow.

Seeking Feedback

I'm eager to improve and learn from the community. Here are a few areas where I'd appreciate your insights:

Data Visualization Best Practices: How can I enhance my charts for better clarity and impact? Are there other libraries or tools you recommend?
Statistical Analysis Depth: What additional statistical methods could provide more insights into the data?
Real Dataset Integration: Suggestions on sourcing real NYC school data and handling potential complexities that come with larger datasets.
Code Optimization: Any advice on making the code more efficient or readable would be highly valued.

Conclusion

Completing the NYC Schools Analysis project was both challenging and rewarding. It allowed me to apply and expand my skills in data cleaning, analysis, and visualization. I'm excited to continue this journey and tackle more complex projects.

Feel free to check out the project on my GitHub repository:

NYC Schools Analysis Project

Thank you for taking the time to read about my project. I look forward to your feedback and suggestions!

Connect with Me:

Email: pelama.arnaud@gmail.com
LinkedIn: Arnaud PELAMA PELAMA TIOGO
GitHub: Arnaud PELAMA PELAMA TIOGO

Processing Customer Reviews with Python: My Journey into Data Science

Arnaud PELAMA PELAMA TIOGO — Sat, 14 Sep 2024 14:27:34 +0000

As part of my ongoing journey to become a Data Scientist, I’ve been tackling a variety of projects that challenge me to apply new concepts, think critically, and build practical solutions to real-world problems. One such project is Customer Review Processing, where I learned how to manipulate text data, process customer reviews, and handle common text formatting issues using Python.

In this blog post, I'll reflect on what I learned, the challenges I faced, how this project helped me grow as a data scientist, and how it highlights the versatility of skills needed for text processing in data science. I would love feedback from the community on how I can further improve this project!

📝 Project Overview

The Customer Review Processing project was focused on cleaning and processing a set of customer reviews, each containing special characters like newline characters (\n) and quotation marks ("). The goal was to flatten these reviews into a single line of text, escape any quotation marks, and concatenate them into a single string, which would be stored in a file for future use.

Key Tasks:

Create a list of review strings.
Escape any quotation marks and flatten newline characters.
Concatenate the reviews into a single string with custom separators.
Print the concatenated string and save it to a text file.

💡 What I Learned

This project gave me valuable insights into string manipulation and text processing in Python. Here’s what I learned:

Working with Special Characters: I gained a deeper understanding of handling special characters like newline (\n) and quotation marks (") in strings. This was particularly important for preparing text data for further analysis or safe storage in formats like CSV.
File Handling: I learned how to write processed text data to files efficiently. Understanding how to open, write, and close files in Python was crucial to ensure the data was saved correctly.
Using List Operations: The project involved iterating through a list of reviews and applying transformations to each string. I improved my skills in working with lists and Python’s built-in methods like .replace().

🚧 Challenges I Faced

While working on this project, I encountered a few challenges:

Handling Special Cases: I initially overlooked certain special cases, like handling empty reviews or reviews that didn’t contain any quotation marks. This required some error handling to ensure the script didn’t crash when encountering unexpected input.
String Concatenation: Finding an efficient way to concatenate multiple strings with custom separators (i.e., " || ") was a bit tricky at first. Python’s join() method turned out to be an elegant solution, but it took some trial and error to implement it effectively.
Maintaining Readability: As the code became more complex, I realized the importance of keeping the code readable and well-organized. Refactoring the code into functions not only made it more modular but also easier to understand and maintain.

📈 Demonstrating Growth

This project was a great opportunity for me to apply skills I had learned previously while developing new ones. Here’s how I’ve grown:

Modular Code: I learned the importance of breaking down the logic into reusable functions, which improved the structure and readability of the code. This skill is crucial as I take on larger projects that require more organization.
Better Error Handling: In this project, I began incorporating basic error handling to catch edge cases, something I had not done in earlier projects. This has made my code more robust and prepared me for more complex scenarios.
Text Data Processing: This project allowed me to dive deeper into text processing. Understanding how to clean and prepare text data is a vital skill in data science, especially when dealing with unstructured data like customer reviews.

🎯 Versatility of Skills Applied

This project spanned several important areas of data science and software development:

String Manipulation: Flattening text and escaping special characters are common tasks when working with text data.
File Handling: Saving processed data for future use is an essential step in data pipelines.
Problem-Solving: I had to think critically about how to approach string operations and ensure that the output was correct and usable in different contexts (e.g., CSV storage).
Code Structuring: Organizing code into functions and keeping it modular made the script easier to maintain and extend in the future.

🔄 What’s Next?

While I’m satisfied with what I’ve accomplished in this project, there’s always room for improvement. Here are a few ideas I have for extending and improving the project:

Read from a CSV file: Instead of hardcoding reviews in the script, I could extend the project to read reviews from an external CSV file, process them, and write the cleaned reviews to another file.
Add More Robust Error Handling: Currently, the script assumes valid input. Adding more comprehensive error handling would make the project more robust when dealing with real-world data.
Natural Language Processing (NLP): I plan to take this project further by exploring basic NLP techniques, such as sentiment analysis, on the processed reviews to extract more insights from the text data.

🙏 Feedback and Suggestions

I’m always looking for ways to improve! If you have any suggestions on how I can enhance this project or if you spot areas that could be optimized, please leave a comment or reach out to me. Whether it’s better string handling techniques or ways to improve file processing, I’m eager to learn from the community.

Thank you for taking the time to read about my project, and I look forward to your feedback!

Connect With Me:

Building a Traffic Grid Analysis with Python: Reflecting on Growth and Tackling Challenges

Arnaud PELAMA PELAMA TIOGO — Sat, 14 Sep 2024 11:48:30 +0000

As I continue my journey towards becoming a Data Scientist, I've been working on various projects to hone my skills, document my learning, and push my limits. One of the more challenging and rewarding projects I've recently completed is the Traffic Grid Analysis. This project involved simulating traffic flow in a city grid and performing multiple analyses such as calculating total vehicles, finding intersections with the highest traffic, and transposing the grid for alternate views of the data.

In this post, I’ll reflect on what I’ve learned, the challenges I faced, how this project has contributed to my growth, and my versatility in tackling various topics. I'll also ask for feedback from the Dev.to community to continue improving.

What I Learned: Skills and Concepts

The Traffic Grid Analysis project gave me the opportunity to solidify and extend my understanding of core Python concepts:

2D Lists & Nested Loops:
- Before this project, I had a basic understanding of lists. However, the use of 2D lists to represent a grid, combined with nested loops for traversing the grid, helped me think more critically about structured data.
- Writing the logic for traversing rows and columns while ensuring data integrity was a practical and invaluable experience.
Function Decomposition:
- I learned the importance of breaking down code into smaller, reusable functions. Instead of writing one large chunk of code, I developed modular functions like initialize_traffic_grid, calculate_total_vehicles, and find_max_traffic_intersections. This not only improved readability but also allowed for easier debugging and scalability.
Data Analysis:
- This project was my first real foray into simulating and analyzing traffic data—something I can envision applying to real-world use cases in smart cities, transportation planning, or logistics.
Error Handling and Edge Cases:
- During testing, I encountered unexpected results when the grid was small or when the range of vehicles was set too low. This taught me to anticipate edge cases and handle them appropriately, such as adjusting grid size or vehicle counts dynamically.

Challenges I Faced: Overcoming Obstacles

No project is without its challenges, and the Traffic Grid Analysis presented several interesting hurdles:

Grid Manipulation:
- I initially struggled with transposing the grid. Understanding the mechanics of matrix manipulation and ensuring that the rows became columns took a few iterations to get right. Eventually, I learned how to use zip and * operators effectively to transpose the grid in a Pythonic way.
Complexity with Nested Loops:
- While I was comfortable with simple loops, handling nested loops to traverse and analyze a 2D grid took careful thought. Debugging logic inside loops—especially when searching for maximum traffic at intersections—required a clear understanding of indices and flow control.
Optimization:
- As the grid size increased, I noticed the script's performance could degrade. Though it was a learning project, it made me aware of how important efficiency is, especially when working with large datasets. In the future, I plan to explore performance optimization techniques, like vectorization or parallelization.

Demonstrating Growth: A Step Forward

Reflecting on this project, I realize how much I’ve grown as a Python developer:

Confidence with Data Structures:
- My understanding of how to organize, manipulate, and analyze structured data has significantly improved. The project made me comfortable with multi-dimensional data, which is a skill I can now apply to other domains, such as image processing or large-scale simulations.
Improved Code Modularity:
- The discipline of writing modular functions has made my code more maintainable and reusable. This is a habit I will carry forward to larger, more complex projects.
Handling Larger Problem Spaces:
- Before this project, I tackled smaller, simpler coding problems. With the Traffic Grid Analysis, I worked on a problem that required planning, decomposition, and multi-step analysis. I now feel more confident in approaching larger projects that demand a clear structure and design.

Highlighting Versatility: Tackling a Range of Topics

This project wasn’t just about traffic simulation—it covered a broad range of topics, each of which brought unique challenges:

Data Simulation: Creating randomized traffic data was a fun exercise in simulating real-world conditions. I used Python’s random module to generate vehicle counts across the grid.
Data Traversal and Analysis: The nested loop structures allowed me to iterate over the grid and gather insights such as total vehicle count and intersections with the highest traffic, skills transferable to matrix manipulation or data mining tasks.
Grid Manipulation: Transposing the grid introduced me to the idea of transforming data for better insight—essential in fields like machine learning, where data preprocessing is crucial.
Scalability: Though simple, this project demonstrated how problems can scale with larger datasets. It helped me realize that as I work with more data in the future, I’ll need to think about efficiency and performance optimization.

Seeking Feedback: I Want to Hear from You

As someone who is actively learning and evolving, I’m eager to hear feedback and suggestions from the Dev.to community. Here are a few questions I’d love to get your input on:

Optimization: What are some performance optimization techniques I could explore for working with larger 2D grids or datasets in general?
Real-World Application: How could this simple traffic simulation project be enhanced or applied to more complex real-world problems? For example, could this be scaled up to handle city-wide traffic data in a smart city project?
Coding Best Practices: Are there any Pythonic ways to further improve my code? Whether it’s more efficient use of data structures or cleaner syntax, I’m always looking for ways to write better code.
Visualization Tools: I’m considering integrating data visualization in future iterations of this project to visualize traffic density. What are some good libraries or approaches for visualizing 2D grid data in Python?

Next Steps: Continuing the Journey

Moving forward, I plan to revisit this project with the community’s feedback and enhance it by:

Adding User Inputs: Allow users to define the grid size and range of vehicle counts.
Incorporating Visualization: Visualizing the traffic data with Python libraries like matplotlib or seaborn.
Exploring Real Traffic Data: Extending the project to handle real-world traffic datasets, possibly integrating with external APIs.

I’m excited to keep learning, building, and sharing as I continue on this journey to becoming a proficient Data Scientist. Thank you for following along with my progress!

Connect with Me:

Feel free to connect with me on GitHub, where I’m documenting all my projects as part of "My Journey To Data Science":

GitHub: My Journey To Data Science

Thanks for reading, and I look forward to hearing your thoughts and suggestions! 😊