DEV Community: Adipta Martulandi

Text Preprocessing for NLP: A Step-by-Step Guide to Clean Raw Text Data

Adipta Martulandi — Wed, 05 Feb 2025 01:21:01 +0000

Natural Language Processing (NLP) is at the heart of many groundbreaking applications, from chatbots and virtual assistants to sentiment analysis and machine translation. However, before any NLP model can perform effectively, the raw text data must undergo preprocessing. This crucial step ensures the text is clean, standardized, and ready for analysis, enabling models to extract meaningful insights and make accurate predictions.

Building a Natural Language Processing (NLP) project involves several key stages, from collecting raw text data to deploying a fully functional model. Each stage plays a crucial role in ensuring that the NLP system is accurate, efficient, and reliable. Picture above depicts step-by-step of a typical NLP pipeline.

In this section, we will explore the essential steps of text preprocessing — from tokenization to language detection — and demonstrate how to implement them using Python. Whether you’re a beginner or an experienced data scientist, this hands-on approach will help you understand how to transform unstructured text into a format suitable for NLP applications. Let’s get started! 🚀

Full Article Here

Lets Build Simple RAG Application Using Langchain

Adipta Martulandi — Tue, 28 Jan 2025 07:33:40 +0000

Large Language Models (LLMs) have revolutionized the way we interact with technology. Their ability to generate human-like text, answer questions, and process language has unlocked new possibilities across various industries. However, despite their impressive capabilities, LLMs have inherent limitations that can impact their effectiveness in real-world applications.

One major drawback is their inability to access up-to-date or real-time information. Most LLMs are trained on static datasets that reflect the state of knowledge at a specific point in time. This means they cannot provide accurate responses about recent events, emerging trends, or newly published data. For instance, if an LLM was last trained in 2022, it would not be aware of events, advancements, or updates that occurred afterward.

This limitation becomes critical when building applications that rely on current or domain-specific knowledge, such as financial forecasts, research insights, or live market data. In such cases, relying solely on pre-trained LLMs can lead to incomplete or outdated answers, undermining the application’s utility.

Full Article Here

Prompt Engineering Techniques to Improve LLMs Performance

Adipta Martulandi — Fri, 17 Jan 2025 02:04:06 +0000

Large Language Models (LLMs) like GPT are powerful tools for generating human-like text, but they are not without their challenges. One common issue is hallucination, where the model produces incorrect, misleading, or nonsense outputs. Understanding why hallucinations occur and how to address them is critical to effectively using these models.

Hallucinations in LLMs refer to instances where the model generates text that is factually inaccurate or contextually irrelevant.

Prompt engineering is the art and science of designing effective prompts to interact with large language models (LLMs) like GPT. At its core, it involves crafting instructions or queries in a way that guides these models to produce desired, accurate, and meaningful outputs. As LLMs grow increasingly sophisticated, prompt engineering has emerged as a critical skill for unlocking their full potential and reducing hallucinations.

Full Article Here

Use this Technique to Reduce LLMs Cost by Over 50%

Adipta Martulandi — Mon, 06 Jan 2025 10:26:28 +0000

Large Language Models (LLMs) have revolutionized the way we interact with and utilize artificial intelligence. From generating text to answering complex questions, their versatility is unmatched. However, this power comes at a significant cost — API usage, measured in tokens, can quickly escalate, making these solutions prohibitively expensive for many individuals and organizations.

Reducing token usage while maintaining output quality is a crucial challenge for making LLMs more accessible and affordable. This is where prompt compression comes into play. By strategically shortening input prompts, we can drastically cut costs without compromising the quality or fidelity of the model’s responses.

In this article, we’ll explore LLMLingua-2, a novel method for efficient and faithful task-agnostic prompt compression. Developed by researchers at Microsoft, LLMLingua-2 leverages data distillation to learn compression targets, offering a robust approach to minimize token usage while preserving performance across various tasks.

Full Article Here

A Practical Guide to Identifying and Mitigating Hallucinations Outputs in Language Models

Adipta Martulandi — Fri, 27 Dec 2024 07:59:03 +0000

Language models have revolutionized how we interact with AI, enabling powerful applications in content creation, customer support, education, and more. However, these models are not without their challenges. One of the most critical issues is hallucination — when a language model confidently produces false or misleading information. These hallucinations can undermine trust, mislead users, and even cause significant harm in sensitive applications.

In this guide, we’ll dive into the concept of hallucination in language models, exploring why it happens and its potential impact. More importantly, we’ll introduce DeepEval, a robust library in Python designed to evaluate and manage the outputs of language models effectively. You’ll learn how to use DeepEval’s metrics to detect, analyze, and mitigate hallucination in real-world scenarios.

Whether you’re a developer, researcher, or enthusiast working with language models, this guide will equip you with the tools and strategies to ensure your AI outputs remain accurate, reliable, and trustworthy. Let’s get started!

Full post here

Everything You Need to Know About LLMs Observability and LangSmith

Adipta Martulandi — Mon, 16 Dec 2024 12:26:28 +0000

In the era of AI-driven applications, Large Language Models (LLMs) have become needs in solving complex problems, from generating natural language to assisting decision-making processes. However, the increasing complexity and unpredictability of these models make it challenging to monitor and understand their behavior effectively. This is where observability becomes crucial in LLM applications.

Observability is the practice of understanding a system’s internal state by analyzing its outputs and metrics. For LLM applications, it ensures that the models are functioning as intended, provides insights into errors or biases, shows cost consumption, and helps optimize performance for real-world scenarios.

As the reliance on LLMs grows, so does the need for robust tools to observe and debug their operations. Enter LangSmith, a powerful product from LangChain designed specifically to enhance the observability of LLM-based applications. LangSmith provides developers with the tools to monitor, evaluate, and analyze their LLM pipelines, ensuring reliability and performance throughout the lifecycle of their AI solutions.

This article explores the importance of observability in LLM applications and how LangSmith empowers developers to gain better control over their AI workflows, paving the way for building more trustworthy and efficient LLM-powered systems.

Full Article Here

Predicting Instagram Influencers Engagement with Machine Learning in Python

Adipta Martulandi — Thu, 12 Dec 2024 16:05:09 +0000

In the last few weeks I tried to make a Data Science Mini Project related to Machine Learning. After thinking for a long time, I finally decided to make a Machine Learning Model that can predict whether an Instagram Influencers Engagement will growing or declining in the following month. This Mini Project is an end-to-end project and this article will be divided into 4 section which are:

Retrieving data from Instagram influencers using Selenium and Beautiful Soup.
Preprocessing data starts from data cleansing, feature engineering, feature selection and etc until the data is ready to be consumed by the Machine Learning model.
Modeling uses Machine Learning Algorithm (Linear Regression, Random Forest, XGBoost) and also do some Tuning Hyperparamaters.
Interpreting of results from prediction output of Machine Learning.

Full Article Here

A Gentle Introduction to Naive Bayes Classifier

Adipta Martulandi — Wed, 11 Dec 2024 00:51:25 +0000

Naive Bayes classifier is a classification algorithm in machine learning and is included in supervised learning. This algorithm is quite popular to be used in Natural Language Processing or NLP. This algorithm is based on the Bayes Theorem created by Thomas Bayes. Therefore, we must first understand the Bayes Theorem before using the Naive Bayes Classifier.

The essence of the Bayes theorem is conditional probability where conditional probability is the probability that something will happen, given that something else has already occurred. By using conditional probability, we can find out the probability of an event will occur given the knowledge of the previous event.

Full article here

K-Nearest Neighbors in Python + Hyperparameters Tuning

Adipta Martulandi — Sun, 08 Dec 2024 09:48:03 +0000

“The k-nearest neighbors algorithm (KNN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression”-Wikipedia

So actually KNN can be used for Classification or Regression problem, but in general, KNN is used for Classification Problems. Some applications of KNN are in Handwriting Recognition, Satellite Image Recognition, and ECG Pattern Recognition. This algorithm is very simple but is often used by Data Scientists.

Langchain: Framework for Building LLMs Applications

Adipta Martulandi — Sat, 07 Dec 2024 08:19:15 +0000

Since the emergence of ChatGPT, Artificial Intelligence, particularly in the field of Natural Language Processing (NLP) and Large Language Models (LLMs), has seen an unprecedented surge in popularity. The ability of models like ChatGPT to understand and generate natural text has unlocked new possibilities for AI-powered applications, ranging from virtual assistants to more intelligent recommendation systems.

Amid this wave of enthusiasm, various frameworks have been developed to simplify the process of building applications powered by LLMs. These frameworks are designed to address challenges such as data management, model processing, and integration. One of the most popular among them is Langchain.

Full story here