DEV Community: Dorcas Bwire

[Boost]

Dorcas Bwire — Tue, 15 Apr 2025 09:47:56 +0000

Youdiowei Eteimorde

Aug 12 '23

Understanding LangChain's RecursiveCharacterTextSplitter

#python #ai #chatgpt #langchain

272

Comments 29

7 min read

Regression with CART Trees

Dorcas Bwire — Mon, 07 Apr 2025 13:48:53 +0000

Introduction
Classification and Regression trees (CART) are a non-parametric decision tree learning technique that produces either classification or regression trees, dependent on whether variable is categorical or continuous. In this context, our focus is primarily on regression, with our goal being to predict a continuous output variable.

Mode of Operation:

The CART algorithm builds a binary tree where each non-leaf node splits the dataset into exactly two subsets repeatedly. Each of the root nodes represents a single input variable (x) and a split point on that variable. Essentially, the dataset is split into number of trees, depending on the criteria of splitting. The criteria could either be: Entropy, Gini or Variance. The splitting is done till the terminal node of the tree is reached.

Process of Building the Trees

Feature Selection Entails the evaluation of the features of the data to identify that which best splits the data. The selection of the ideal input variable and the specific split is chosen using a greedy algorithm to minimize the cost function such as the mean squared error.
Binary Splitting
Upon selection of the best feature, a binary split is created in the data to two child nodes
Recursive Tree Building
The process is ongoing until a stopping criterion is met, such as the minimum number of samples in a node, or the maximum tree depth.
Tree Pruning
Upon building of the full tree, pruning begins. It entails examination of the tree sections to identify branches that can be removed without a significant loss in prediction accuracy. The simplest pruning approach involves working through each leaf node in the tree, while evaluating the effect of removing it using a hold-out test set. Leaf nodes are removed when there is a drop in the overall cost function on the entire test set.

Application of the CART Algorithm
There are diverse applications of the CART algorithm, attributed to its ability to handle both the classification and regression problems, coupled by the transparent nature of decision trees. This provides valuable insights and predictions to the different domains.

In the healthcare sector, the importance of timely and accurate diagnosis cannot be underscored. This facilitates the prediction of the likelihood of a patient having a particular disease based on the symptoms and test results. The CART algorithm facilitates determining the risk of patients developing complications post operation, based on factors like age, surgery type and pre-existing conditions. From a financial standpoint, the CART algorithm facilitates the prediction of the creditworthiness of customers based on variables like debt ratios, employment status and income.

Chi-Square Tests and Degrees of Freedom

Dorcas Bwire — Fri, 07 Mar 2025 19:47:32 +0000

The Chi-Square test, or χ² test indicates the existence of a relationship between two categorical variables. To expatiate, this analysis will mirror concert organizers using chi-square tests to determine whether the genre of music: Afro or Jazz affects the audience attendance. Essentially, the test checks whether or not observed data fits those that would be expected, assuming that there is no association whatsoever. The chi-square test helps in determining if there is a relationship between music genre and attendance.

To compute the chi-square test, the following formula is used:

Where O is the observed value
E is the expected value
If the p-value <=0.05, we reject the null hypothesis, and if p-value > 0.05, we fail to reject the null hypothesis.
The steps to conducting the chi-square test include:

Define the hypothesis, both null and alternative hypothesis
Gather and organize the data
Calculate the expected frequencies
Compute the chi-square test
Draw the conclusion Degrees of freedom indicate the number of independent observations or variables that can vary in an analysis without breaking any constraints, readily available to estimate a parameter. In chi-square tests, there are three ways for calculating the degrees of freedom:

a). Goodness of Fit Test
In this test, it checks whether the observed distribution of a single categorical variable matches the expected distribution. In this context, we analyze the frequency distribution of how often the audiences choose Afro versus Jazz concerts.
df = k-1 where:
k = number of categories

b). Test of Independence
The test assesses the relationship between two categorical variables, such as music genre (Afro/Jazz) and the attendance level (high/level)
df = (r-1) x (c-1) where:
r is the number of rows,
c is the number of columns in the contingency table

c). Test for Homogeneity
Entails comparison of the distribution of the categorical variable across different populations. In this context, we would compare the two music genres: (Afro and Jazz) and how they vary between different cities where the concerts are held.

Lets assume there are three cities and two music genres, the df = (3-1)*(2-1) = 2

It is worth noting that the shape of the chi-square distribution evolves as the df increases. This is attributed to how the sum of squared differences between the observed and expected frequencies depend on the number of independent comparisons made.
Notably, the df is not always monotonically decreasing, the shape is dependent of the freedom of the data to vary. Employing the concert planning analogy, the more the elements juggled such as venues, audience preferences and genres, it becomes inherently complex, leading to varied potential outcomes.

Hypothesis Testing

Dorcas Bwire — Mon, 24 Feb 2025 07:24:51 +0000

Simply explained, a hypothesis is an educated guess. Hypothesis play a pivotal role in facilitating decision-making, given we live in a data-driven age. Hypothesis testing is a structured approach for determining whether the findings of a study provide sufficient evidence to support a specific theory relevant to a larger population. A hypothesis test assesses how unusual the result is, and where it is reasonable chance variation or whether the result is too extreme to be regarded as chance variation.

Primarily, hypothesis testing seeks to test whether the null hypothesis can be rejected or approved. In the event it is rejected, the alternative hypothesis can be accepted. If the null hypothesis is accepted, it implies the alternative hypothesis is rejected. Thus, a value is set in order to gauge whether the null hypothesis is accepted or rejected, and whether the result is statistically significant.

Process of Hypothesis Testing.

The hypothesis testing process is classified into different phases:

1.Restate the research question as the alternative hypothesis, and null hypothesis about the population.

The null hypothesis states that there is no effect or difference, which is the hypothesis one attempts to reject with the test.
The alternative hypothesis is that which is being tested, expressed as a correlation or statistical relationship between variables.
Determine the significance level, often denoted by alpha (α). It implies the probability of rejecting the null hypothesis when it is true. The p-value depicts the probability that, assuming the null hypothesis is correct, you might still observe results that are at least as extreme as the results of your hypothesis test. A smaller p-value increases the likelihood for the alternative hypothesis being correct, and the greater the significance of the results.

One-sided vs. Two-sided Testing
Sampling

Why Hypothesis Testing?

It helps in estimating the sampling error, and factor it into the test results, facilitating effective decision-making.