DEV Community: Marcos Leal

How I approximated a function f(x) using an image as input

Marcos Leal — Wed, 26 Jul 2023 22:04:17 +0000

Recently I was reading this scientific paper and came across this graph:

In a first glance it looked pretty much like a gamma distribution. Using scipy's gamma function I could create a similar curve using two parameters: the shape and scale. If the data in the graph was a truly gamma these would suffice but as they have other meanings in each scale I also needed a third parameter to take account for the y-axis range

As the data is not provided with the graph I was able to approximate it using PlotDigitalizer which is a web solution that allows to upload a graph and, given the axis scales, estimates the points on the graph.

With this data I was able to minimize my desired function using the curve_fit

from scipy.optimize import curve_fit
import numpy as np
from scipy.stats import gamma

def func(x, a, b, c):
    return gamma.pdf(x, a, scale=b)*c

popt, pcov = curve_fit(func, df['x'], df['y'])

popt, pcov

>> (array([2.38845754e+00, 4.14949111e+01, 1.65402740e+04]),
>> array([[ 3.66739537e-02, -9.02425608e-01, -7.11863037e+01],
        [-9.02425608e-01,  2.56723468e+01,  2.87599647e+03],
        [-7.11863037e+01,  2.87599647e+03,  9.93779351e+05]]))

And then plot it using the plotly library

import plotly.graph_objects as go
import numpy as np
from scipy.stats import gamma

#1 ml of 1,4-butanediol is equivalent to 1.4 g of Na-GHB.

x = np.linspace(0, 600, 100)
y = gamma.pdf(x, a=popt[0], scale=popt[1])*popt[2]

fig = go.Figure(data=go.Scatter(x=x, y=y))

# Add data from the dataframe df
fig.add_trace(go.Scatter(x=df['x'], y=df['y'], mode='markers', name='original'))

# Edit the layout to add title, and change x and y axis labels
fig.update_layout(title='GHB Concentration in Blood', xaxis_title='Time (min)', yaxis_title='Concentration (mg/L)')


# Edit the layout make the width of the graph 600px
fig.update_layout(width=500)
fig.show()
fig.write_image("fig1.png")

In this example I wanted to approximate my data as a gamma function but in reality it could be any given function with any set of parameters. The curve_fit uses non-linear least squares to fit a function f. If I didn't had a function in mind I could use other machine learning techniques to obtain generic approximations and have a better guess on which approximations to use.

Introduction to Recommendations Systems

Marcos Leal — Mon, 24 Jul 2023 13:31:45 +0000

In today's fast-paced digital world, online shopping has become an integral part of our lives. With countless options available at our fingertips, the challenge for e-commerce businesses lies in making the shopping experience more enjoyable, efficient, and personalized for each individual customer. This is where the marvel of recommender systems comes into play. In this blog post, I'll share my recent work on personalization in e-commerce and delve into the fascinating world of recommender systems, exploring their advantages and the positive impact they have on both customers and businesses alike.

As e-commerce platforms have expanded, recommender systems play a pivotal role in offering customers the best possible match for their interests. Some systems have even demonstrated an uncanny ability to anticipate customer needs, exemplified by the famous Target story of mailing baby products discounts to a customer before she knew she was pregnant.

History of Recommender Systems (image from Overview of Recommender Systems And Implementations)

Recommender systems form a family of algorithms aimed at maximizing user-item compatibility. By analyzing previous user-item interactions, these systems make personalized recommendations, guiding customers to the most relevant products based on their preferences and restrictions. It's possible to observe that some of the oldest systems rely on simple interactions and small systems but nowadays on companies with plethora of data available complex systems have been built and are performing with excellence daily.

Leading platforms like Netflix and Spotify have integrated cutting-edge recommender systems into their services. These platforms excel at offering personalized suggestions, guiding users through their vast catalogs to discover content that resonates with their tastes.

Recommender systems primarily rely on user-item interaction data, expressed explicitly (e.g., movie ratings) or implicitly (e.g., listening time to songs). Analyzing this data allows the systems to learn user preferences and make accurate recommendations.

Recommender systems offer businesses deeper insights into customer needs and behaviors, allowing them to make informed decisions about product development and marketing strategies. By enhancing the overall shopping experience, businesses can improve customer retention and boost sales.

Smaller companies may face challenges with limited data or lack of expertise in building recommender systems. To overcome these hurdles, they can explore plug-and-play solutions like AWS Personalize, which provide a simple and scalable approach without requiring in-depth algorithm knowledge.

The foundation of a successful recommender system lies in the initial dataset. Access to good representative data is crucial. While numerous algorithms and systems are available, they are only as effective as the data they are fed.

In the next posts, we'll dive into algorithms, implementations, and challenges of recommender systems.

Accelerating Training using most important data points

Marcos Leal — Wed, 22 Jun 2022 14:35:16 +0000

Reading the article Prioritized Training on Points that are learnable, Worth Learning, and Not Yet Learnt which was accepted this week in ICML 2022 they propose the Reducible Holdout Loss Selection (RHO-LOSS)
They assume that training time is a bottleneck
but data is abundant and possibly has outliers which is frequent scenario in web-scraped data. Since the data is so abundant it's also frequent to achieve SOTA in less than half of one epoch. To obtain this data selection they use the following algorithm:

Given the training set and the holdout set two models are trained: the main one which we optimze with gradient descent and a second one which is simpler and is only trained on the holdout set. Given this second simpler model has a holdout set with the same distribution of the training set it's possible to use its loss to filter out the next batch examples for the main model. Given this smaller trained model and a large batch to filter we pre-calculate the IrreducibleLoss given the loss of this smaller model for all the training set. Then we select a large random batch from the training set and compute the Loss using the main model. We then compute the RHO-LOSS subtracting the loss and the respective irreducibleLoss and sort the samples in terms of it filtering the top n samples to be passed to the small batch to the main model.

This approach has 3 main impacts on the data selection:
1) Redundant Pointes: It filters examples that would be too easy (i.e.: low loss) to the main model. Given that the model already perform well in those there's no real reason to loose training time with them again. And even if the model "unlearn" it's possible to recover this in the next batch of samples.
2) Noisy points: Other works focus on selecting data that has high training loss but thos points might be ambiguous or incorrect data, specially given the assumption that the data was web-scrapped and quality is not assured. These would have high IL and low reducible loss, putting them in lower positions of the ordered list
3) Less relevant points: Another pitfall of selecting only high loss data is that these might be outliers and shouldn't be prioritized. The holdout is drawn from the same distribution of the true data and since it's smaller expected to have less outliers. Given this situation where either models perform bad on these outliers points the RHO-LOSS tends to be small and less prioritized.

The paper point out that this approach works either on large and small datasets given that the small ones are at least doubled. In the experiments they use a large batch 10x larger than the small batch output (the samples that will be forwarded to the main model).

The same IL model can be use to optimize multiple larger models at once and can be trained even without a holdout set. To achieve this train two IL model with half of the holdout set each and let each decide for half of the selected data. Using this approach on multiple models it's possible to speed up hyperparameter sweep