DEV Community: bright inventions

Spring Tests with TestContainers

bright inventions — Wed, 15 May 2024 14:20:27 +0000

In the world of software development, making sure our apps are up to scratch before they go live is crucial. But here's the catch: testing them in a way that mirrors what happens in the production may not be so straightforward. That's where TestContainers come into the picture. It’s a handy library that lets us bring in real databases, web browsers, and more, all within Docker containers managed through code, to make our tests as close to a real-life environment as possible.

From simulation to a real environment

Back in the day, we'd often rely on simulated services or in-memory databases for testing, which was okay but could be better. They just couldn't fully mimic the complexities of real-life scenarios. This mismatch could lead to apps breaking down in the real world even though they passed all tests with flying colors. We couldn't test persistence to ensure, that our data meets all DB constraints. TestContainers help us dodge this bullet by letting us test with the actual tools and services our app will interact with, but in a safe, controlled environment.

Enhanced testing

In this article, we’re diving into how to integrate TestContainers into Spring integration tests, a powerful framework widely used in Java/Kotlin applications for enterprise-level development. Spring’s inherent complexity, combined with the need for consistent and reliable testing, makes the integration of TestContainers particularly beneficial.

We are also going to use the java-test-fixtures plugin to create reusable Spring annotation, which will be used to set up Postgres test container for our domain module’s tests.

What is TestContainers?

TestContainers is an open-source set of libraries that supports JUnit tests, providing lightweight, throwaway instances of common databases, Selenium web browsers, or anything else that can run in a Docker container. It simplifies the process of creating unit and integration tests by providing a programmable environment that is both controlled and isolated. This is particularly useful for testing database interactions, message queues, web applications, and other services that are typically complex to set up and manage for testing purposes.

You can read more about TestContainers in the official documentation.

youtube.com

What is TestFixtures?

testFixtures in the context of software development is a concept related to testing, particularly automated testing. Test fixtures are a set of preconditions or inputs that are used to consistently test a piece of software.

The testFixtures Gradle plugin is a feature in Gradle, a popular build automation tool, designed to facilitate the sharing of code and resources between the main source set and the test source set in a project. This plugin is particularly useful in Java and other JVM-based projects.

Example

As the example, we reused the code prepared by us for the article How to integrate a Spring Boot app with Grafana using OpenTelemetry standards.

Project structure

We have modified the project structure as presented below:

spring-observability-bootstrap
├── appointment
│   ├── main
│   └── test
├── database
│   ├── main
│   ├── test
│   └── testFixtures
└── src
├── main
└── test

We extracted :database module, so the database configuration is separated from the business logic. From now on, if we want to use the database in a new module, we can just add a dependency to the :database module:

implementation(project(":database"))

All dependencies required to configure the Postgres database were moved to the :database module.

We also created a new :appointment module, containing business logic responsible for the appointments management feature. This is the module, which we are going to add our integration tests into.

Problem

Most Spring + TestContainers tutorials show you how to integrate TestContainers with JUnit, but in most cases, you also have a framework, that runs your tests, like Spring in our case. In case of Spring, these tutorials instruct you to create an abstract test class and extend all your Database test classes with this abstract class to run test containers, which is not the best practice as the “Composition over inheritance” rule says.

What most of these tutorials are showing you is the way to:

Start Spring context
Start TestContainer
Inject TestContainer configuration into Spring context.

This may be problematic for a couple of reasons:

Spring may require running services before starting the Spring Context
- Some Spring Beans, like liquibase or flyway, need Datasource before being instantiated, so we would like to have the database running before the Spring context starts.
Reusable containers
- Starting a new docker container takes time. If you create a container field in your test classes and annotate it with @Container, as integration with JUnit suggests, then you are starting a new container for each test class
Non-compliance with the “Composition over inheritance”
- Some tutorials suggest sharing container object between classes by the use of base abstract class. But what if we want to start 2 different test containers for one test class? For example Postgres as Database and Redis as cache? Do we need to create another abstract class extending from PostgresTestContainerTest called PostgresAndRedisTestContainerTest? And if we need only Redis, we create a third one only for Redis? It’s not a good approach.
Reusable Spring Context
- Even if you optimize your tests to share containers by base classes, this does not mean, that these tests will share Spring Context. Starting a new Spring Context is also time-consuming for bigger projects. It may be also a good approach to configure your tests(or at least groups of tests) to share Spring Context.

Solution

Luclky, we came up with the solution, that may solve all of these problems!

What we can do instead is pre-configure Spring Context to set up TestContainers during the Spring Context initialization phase. We are going to use the @ContextConfiguration annotation. It requires passing initializer extending ApplicationContextInitializer. Our PostgresTestContainersInitializer looks like this:

class PostgresTestContainersInitializer :
    ApplicationContextInitializer<ConfigurableApplicationContext> {
    override fun initialize(applicationContext: ConfigurableApplicationContext) {
        val postgresSqlContainer = PostgreSQLContainer<Nothing>("postgres:15.4")

        postgresSqlContainer.start()

        // should shut down container on context close
        applicationContext.beanFactory.registerSingleton("postgresSqlContainer", postgresSqlContainer)

        TestPropertyValues.of(
            mapOf(
                "spring.datasource.url" to postgresSqlContainer.jdbcUrl,
                "spring.datasource.username" to postgresSqlContainer.username,
                "spring.datasource.password" to postgresSqlContainer.password,
            )
        ).applyTo(applicationContext)
    }
}

The overridden initialize() method does 3 things:

Create PostgresSQL Container:
- A PostgresSQLContainer object named postgresSqlContainer is created using the image postgres:15.4. This step initializes a PostgresSQL container using the specified Docker image.
The start() method is called on the postgresSqlContainer object to start the container.
Register Container in ApplicationContext:
- The PostgresSQL container is registered as a singleton bean in the Spring application context. This allows the container to be managed and accessed within the Spring application.
Should shut down container on context close
Set Database Properties:
- TestPropertyValues is used to set various properties related to the database. These properties include the database URL (jdbcUrl), username, and password. These values are retrieved from the postgresSqlContainer object.
The applyTo() method applies these properties to the applicationContext. This ensures that the Spring application can connect to the PostgreSQL database running in the Docker container using these properties.

Then we can annotate our Spring Test classes with annotation:
@ContextConfiguration(initializers = [PostgresTestContainersInitializer::class])

If we want to keep it pretty, we can create our custom annotation over @ContextConfiguration:

@ContextConfiguration(initializers = [PostgresTestContainersInitializer::class])
annotation class PostgresTestContainer

And use it like this:

@SpringBootTest
@PostgresTestContainer
internal class AppointmentServiceTest {
...
}

That’s it!

Now you only need this one PostgresTestContainer annotation, to run Postgres TestContainer for your Spring Test.

You can access the full code in our example repository.

By Maciej Nawrocki (Senior Backend Developer) and Adam Waniak (Backend Developer) @ bright inventions.

Debugging production CDK Node.js app with AWS Fargate

bright inventions — Thu, 02 May 2024 12:11:12 +0000

Recently my colleague wrote a blog post on how to create a cheap Node.js Fargate service. Imagine that after some time of happy running, you investigate that something is not clearing memory or the task suddenly exits with an error. You analyze the log and metrics, but the issue seems to be deeper, and you have to get your hands dirty. For such cases, a great option to debug the Node.js service is with inspector. In this tutorial I will show you how to utilize it with CDK deployed Node.js app on AWS Fargate using AWS ECS Exec and AWS SSM port forwarding.

Node.js debugging

Node.js inspector debugging with --inspect flag might show you potential problems with the event loop or where is the memory leak you are looking for. It personally helped me many times. If you are interested about the details and how to look for problems, you can see a great video of it in action with one of the core contributors of Node.js.

Remote Node.js debugging

Ok, so you are armored with cool knowledge about Node.js debugging now. You say, “Great! Let’s run inspector and check it out.” Not so fast—your service is running in a remote environment. It means that you somehow have to expose a remote debugger to your local inspector environment. Sometimes, it is not necessary as you might be able to spot the problem when running the process locally. But what if the problem only appears when some particular thing happens on a remote? Traditionally, you could just expose the inspected port via SSH local forwarding. But what if you are running Fargate and you are not able to SSH to underlying machine? Let’s find out!

Checking if your task is eligible for ECS exec

Ok, so your task is running. What shall you do next? You will use a combination of AWS ECS Exec and AWS SSM port forwarding to forward the debugger port to the local machine. AWS has an official GitHub repo with a script by which you can check if your task allows for AWS exec. Using the infrastructure described before, you should configure AWS CLI and execute

./check-ecs-exec.sh cluster-id task-id

in our case

./check-ecs-exec.sh BrightCheapEcsFargateStack-ClusterEB0386A7-PQVGdDDGFxS4 70d6a6e5606b4cf5ad413821326bd765

As per the output, task is missing some of the required things for exec execution:

  Exec Enabled for Task  | NO

Task Role Permissions    
     ssmmessages:CreateControlChannel: implicitDeny
     ssmmessages:CreateDataChannel: implicitDeny
     ssmmessages:OpenControlChannel: implicitDeny
     ssmmessages:OpenDataChannel: implicitDeny

The Readme of the project provides directions on how to fix the potential issues you might have. It might be connected either to your IAM user, ECS task role permissions, or configuration. For the repo you are using, the only two things that you needed to add are:

enableExecuteCommand: true

to FargateService in CDK definition and
changing container command to expose debugger

command: ['npx', '--node-options=--inspect', 'http-server']

Upon CDK deployment you will see that required ssmmessages permissions are added automatically.
When you will rerun the script I will see that all controls are green or yellow. That means that we can connect to our task using AWS exec!

Connecting to Fargate ECS task

To do it, you need to know cluster-name, task-id and runtime-id of the task first.
You can get those by running

aws ecs describe-tasks \
    --cluster cluster-id \                                                                                  
    --task task-id

in our case

aws ecs describe-tasks \
    --cluster BrightCheapEcsFargateStack-ClusterEB0386A7-PQVGdDDGFxS4 \                                                                                  
    --task 70d6a6e5606b4cf5ad413821326bd765

Runtime ID in above case is 70d6a6e5606b4cf5ad413821326bd765-2750272591 so we can run the following:

aws ssm start-session \
    --target ecs:BrightCheapEcsFargateStack-ClusterEB0386A7-PQVGdDDGFxS4_70d6a6e5606b4cf5ad413821326bd765_70d6a6e5606b4cf5ad413821326bd765-2750272591 \
    --document-name AWS-StartPortForwardingSession \
    --parameters '{"portNumber":["9229"], "localPortNumber":["9229"]}'

Where target is a string that consists of ecs:<cluster-name>_<task-id>_<container-runtime_id>. Port 9229 is the default port for Node.js inspector.
If all is ok, as a response you will get

Starting session with SessionId: rafal.hofman@brightinventions.pl-06148b47c2f094b19
Port 9229 opened for sessionId rafal.hofman@brightinventions.pl-06148b47c2f094b19.
Waiting for connections...

Connection accepted for the session [rafal.hofman@brightinventions.pl-06148b47c2f094b19]

Running local inspector with remote ECS target

After the successful connection you can go to inspector in Chrome browser (chrome://inspect). You can see there the remote target connection you just enabled, forwarded to your local port of 9229

Upon connection, you can see logs from the Node.js process and can go ahead with debugging.

As you can see, the process is pretty straightforward. You do not have to expose your --inspect process port publicly, but can safely use AWS SSM port forwarding.

What is important, the task can have a private IP, and you can still access it! If needed, you can also use AWS exec to bin/bash to the container. Remember to remove enableExecuteCommand when you are done. Happy coding & debugging!

By Rafał Hofman, Fullstack developer @ bright inventions.

Retrieval Augmented Generation (RAG) in Machine Learning Explained

bright inventions — Thu, 18 Apr 2024 10:03:20 +0000

Imagine that your company has access to a powerful AI tool that can process vast amounts of data and extract significant conclusions, identify key information, and effectively summarize it. Such capabilities could significantly enhance the efficiency of your employees' work, allowing them to focus on the most valuable aspects of their job, rather than on time-consuming data processing. In this context, Retrieval Augmented Generation (RAG) opens new perspectives. RAG allows for the integration of AI models with specific, internal data of your company, enabling not only processing but also intelligent interpretation and utilization of this knowledge. In this article, we will explore how to accomplish this.

Retrieval Augmented Generation (RAG) definition

RAG is a technique that allows expanding the knowledge of the pre-trained language model with real-time information retrieval from a large database of documents.

The basic prompt schema for querying a machine learning model looks like this:

In this situation, we ask the machine learning model about the capital of Poland. This is general knowledge, and our model has no problems with the answer.

Going deeper with Retrieval Augmented Generation in machine learning

Fancy to go deeper with this simple example? Let's say we would like to have a machine learning model that can answer questions about the plot of our original, never-published 300-page book titled 'My Story,' the only source of which is a .pdf file on our private laptop. Therefore, there is no chance that the model came into contact with this book during training, nor is there any chance it could find any information about it elsewhere.

If we asked the learning model about this story, the model could not answer it. This is how it would look:

In such a situation, Retrieval Augmented Generation (RAG) comes to the rescue. We can simply expand the knowledge of the machine learning model by adding contextual information to the prompt.

In theory, it would look as follows:

In theory, it would work. The model receives our query along with the entire book, so it now knows the story and can answer our query. However, there is a practical problem with this solution.

The number of tokens that we can use with one prompt is limited. For example, for ChatGPT-4, this limit is 8192 tokens; even for GPT-4 Turbo, the limit is 128,000 tokens.

Let's assume that one page of our book has an average of 500 words. 300 pages times 500 words equals 150,000 words in the entire book. We should remember that the number of used tokens consists of the prompt query, prompt context, and the machine learning model's answer.

This amounts to 150,000 tokens for the context alone. By adding the prompt query and the machine learning model's answer, the total will be even higher. Even if sending such a prompt were possible, it would simply be a waste of resources and money. We don’t need the entire context of the book to answer our queries.

It seems obvious that we need to divide our book into chunks, and for the context of the prompt, attach only those chunks that are relevant to our question. Dividing the text into chunks is a simple task, but how do we determine which parts are necessary to get the answer to our query?

Here, the technique of representing text as numerical vectors, known as embeddings, comes to the rescue. There is another blog post where you can learn more details about how embeddings work.

For now, it's enough to understand that embedding is a technique that converts text into numerical vectors, which retain the meaning of the converted sentence. Depending on the sentence's meaning, these vectors are positioned at specific locations in the vector space. So, now we know that before running our prompt, we have to first prepare the data (the book in our case) by dividing it into chunks, converting them into numeric vectors with the embedding technique, and saving them in a vector database.

This process looks like this:

Great! We have prepared our data so that we can easily find exactly the parts of the book that are useful for our query.

With this knowledge and the data prepared, let's start the process of obtaining answers from the machine learning model once again. The image below describes all the steps undertaken during this process.

By Paweł Polak, Fullstack Developer @ bright inventions

Understanding Embeddings: A Short Guide with an Example

bright inventions — Mon, 15 Apr 2024 06:17:08 +0000

Embeddings are an invisible, yet important part of many technologies we encounter. From internet search engines, through recommendation systems and advertisement personalization, to advanced analyses of images, videos, and technologies for speech and sound recognition – embeddings play a key role everywhere. In this article, we will explain how embeddings work and how they facilitate, and enrich our daily experiences with technology.

As I mentioned above, vector embeddings are a popular technique to represent information in a format (typically as a vector of numerical values) that can be easily processed by algorithms, especially deep learning models. This ‘information’ can be text, pictures, video, and audio.

For example, the conversion of the word 'dog' into a numerical vector representation could look like this:

What is the embedding dimension?

A crucial factor in determining the quality and effectiveness of the embedding is the embedding dimension. Generally, the term 'dimensionality of word embedding' refers to the total count of dimensions used to define a word's vector representation. This number is usually established during the development of the word embedding and indicates how many distinct features are included in the vector representation of the word.

For text embeddings, these vectors are constructed in a way that captures the semantic meaning of the text. This ensures that words or sentences conveying similar meanings are close to each other in the embedded space, often referred to as a vector space.

What does that mean? Here’s a simple example

Let's say we have a space with only two dimensions - [x, y], where x represents sex and y represents activity.

Now, for example, by asking the question 'Who is walking?' the embeddings will search for vectors that include the 'Walk' sentence in the y dimension.

We can observe that the sentence "Walk" is associated with a woman, a man, a boy, and a girl. This means that all of them are walking. And thus, we get our answer: A woman, a man, a boy, and a girl are walking.

Remember, this is a very simplified example with only two dimensions of meaning. In reality, the more such dimensions there are, the better our embedding is at guessing the meanings of saved sentences as words or phrases.

Ready for more embedding tutorials?

Delve deeper into various applications of embedding in AI:

Retrieval Augmented Generation (RAG) in Machine Learning Explained

By Paweł Polak, Fullstack Developer @ bright inventions.

First Steps with AWS Bedrock

bright inventions — Thu, 14 Mar 2024 14:02:11 +0000

AI is taking over the world. At Bright Inventions, we've already helped several clients with generative AI. In this blog post, we'll see how to use aws-cdk to create a simple API that responds to prompts.

Request Bedrock model access

If you haven't used Bedrock before, the first step is to request model access.

You can do so
in AWS Console > Bedrock Model > access page:

For Claude and Claude Instant models describe your use case briefly.

Define your AWS Lambda function with aws-cdk

Declaring an AWS Lambda function with aws-cdk is straightforward. Our function needs to invoke Bedrock models, hence
appropriate IAM permissions are necessary. For simplicity, we'll use AWS Lambda Function URLs with authentication type
NONE.
You should use that only for evaluation purposes.

export class BrightBedrockSimpleStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const lambdaPrompt = new NodejsFunction(this, 'lambda', {
      architecture: Architecture.ARM_64,
      timeout: Duration.seconds(30),
      entry: path.join(process.cwd(), 'lib', 'bedrock-client', 'simple-api.lambda.ts'),
    })

    lambdaPrompt.addToRolePolicy(new PolicyStatement({
      actions: ['bedrock:InvokeModel'],
      resources: ['*']
    }))

    const functionUrl = lambdaPrompt.addFunctionUrl({
      authType: FunctionUrlAuthType.NONE
    });

    new CfnOutput(this, 'function-url', {
      value: functionUrl.url
    });
  }
}

Invoking Bedrock amazon.titan-text-express-v1

Bedrock provides multiple models. The models differ not only in terms of their capabilities but also in their API. For
starters, let's use Titan Text Express.

export const handler: APIGatewayProxyHandlerV2 = async (event, context) => {
  const body = event.body!

  const response = await bedrock.send(new InvokeModelCommand({
    body: JSON.stringify({
      inputText: body,
    }),
    contentType: 'application/json',
    modelId: "amazon.titan-text-express-v1"
  }));

  const modelResponseJson = response.body.transformToString();

  return {
    statusCode: response.$metadata.httpStatusCode ?? 500,
    headers: {
      'Content-Type': response.contentType
    },
    body: modelResponseJson
  }
}

After we deploy the code, we can invoke our function and ask basic questions, e.g.

curl -s -X POST --location "https://${YOUR_LAMBDA_ID}.lambda-url.eu-central-1.on.aws" \
    -H "Content-Type: text/plain" \
    -d 'Which country has the highest GDP?' | jq
{
  "inputTextTokenCount": 7,
  "results": [
    {
      "tokenCount": 34,
      "outputText": "\nThe country that has the highest GDP is the United States. Its total GDP is $23.07 trillion in terms of purchasing power parity (PPP).",
      "completionReason": "FINISH"
    }
  ]
}

Titan Text Express configuration

We can control and tweak some the aspects of how the model responds to our prompts. For example, for Titan Text Express
we can
configure:

temperature: Float value to control randomness in the response (0 to 1, default 0). Lower values decrease randomness.
topP: Float value to control the diversity of options (0 to 1, default 1). Lower values ignore less probable options.
maxTokenCount: Integer specifying the maximum number of tokens in the generated response (0 to 8,000, default 512).
stopSequences: Array of strings indicating where the model should stop generating text. Use the pipe character (|) to separate different sequences (up to 20 characters).

Let's modify our lambda to allow controlling the parameters.

const response = await bedrock.send(new InvokeModelCommand({
  body: JSON.stringify({
    inputText: body,
    textGenerationConfig: {
      temperature: parseFloat(event.queryStringParameters?.temperature ?? '') || undefined,
      topP: parseFloat(event.queryStringParameters?.topP ?? '') || undefined,
      maxTokenCount: parseInt(event.queryStringParameters?.maxTokenCount ?? '') || undefined,
    }
  }),
  contentType: 'application/json',
  modelId: "amazon.titan-text-express-v1"
}));

After deploying, we can control the model parameters through query string parameters:

curl -X POST --location "https://${YOUR_LAMBDA_ID}.lambda-url.eu-central-1.on.aws?temperature=0.9" \
    -H "Content-Type: text/plain" \
    -d 'What is the country that has the most freedom of speech in the world?'

{
  "inputTextTokenCount": 15,
  "results": [
    {
      "tokenCount": 28,
      "outputText": "\n\"The United States has the most freedom of speech in the world, according to the 2022 Freedom House Index.\" ",
      "completionReason": "FINISH"
    }
  ]
}

The results will be more elaborate if we change topP.

curl -X POST --location "https://${YOUR_LAMBDA_ID}.lambda-url.eu-central-1.on.aws?temperature=0.9&topP=0.1" \
    -H "Content-Type: text/plain" \
    -d 'What is the country that has the most freedom of speech in the world?'

{
  "inputTextTokenCount": 15,
  "results": [
    {
      "tokenCount": 51,
      "outputText": "\n\"The United States is considered the country that has the most freedom of speech in the world. This freedom is protected by the First Amendment of the U.S. Constitution, which guarantees the right to freedom of expression, assembly, and religion.\" ",
      "completionReason": "FINISH"
    }
  ]
}

Summary

As you see, it is straightforward to get started with AWS Bedrock. The full example of this blog post is available in
GitHub repo.

By Piotr Mionskowski, Head of Technology & Partner @ Bright Inventions

Cheapest ECS Fargate Service with HTTPS

bright inventions — Mon, 26 Feb 2024 14:56:14 +0000

There's plenty of ways to run a docker image in AWS. Custom EC2 images, ElasticBeanstalk, ECS Classic and Fargate and
finally EKS. We have a lot of articles and guidelines for production best practices. However, not all workloads require the same levels of resiliency, security or robustness. Sometimes, all we want is an easy and economical way to run a webserver. In this article, you'll find how to run a web service in ECS Fargate cheaply using aws-cdk.

Network

For most custom workloads in AWS, we need VPC. Creating one with aws-cdk takes few lines of code. However, to
reduce costs, we need to make sure that we have no Nat Gateways:

const vpc = new Vpc(this, 'Vpc', {
  natGateways: 0 // $30 a month
})

Each NAT Gateway instance costs around $30 a month. When we run multiple services in a single VPC, the cost will be
minuscule in comparison. However, when we run a single service, that's a major portion of the total.

Fargate service

Without NAT Gateways for our workloads to be able to talk to the internet, they have to be in a public subnet
and have a public IP address.

To save additional money, we'll use FARGATE_SPOT capacity provider. This offers a variable rate, up to 70% discount vs
on demand instances.

const task = new FargateTaskDefinition(this, 'task')

const service = new FargateService(this, 'Service', {
  cluster: new Cluster(this, 'Cluster', { vpc }),
  assignPublicIp: true,
  vpcSubnets: {
    subnetType: SubnetType.PUBLIC
  },
  taskDefinition: task,
  capacityProviderStrategies: [{
    capacityProvider: 'FARGATE_SPOT', // 70% discount
    weight: 1
  }]
});

Example workload

A web service usually accepts HTTP traffic on a specified port. Let's use http-server for demonstration purposes:

const backend = task.addContainer('backend', {
  image: ContainerImage.fromRegistry("node:20-alpine"),
  command: ['npx', 'http-server'],
  workingDirectory: '/srv',
  portMappings: [{
    containerPort: 8080
  }]
})

Note that our container name in the task is backend. We also expose port 8080 to other containers running in the
task.

HTTPS is a must these days

We should never expose any web service without HTTPS. In a typical setup, we would either use an AWS Load Balancer or
AWS API Gateway. The cost of ALB is roughly $16 a month. API gateway would be cheaper. However, it would add complexity
to our solution.

Let's use Caddy which can act as reverse-proxy with automatic HTTPS coverage.

 const hostedZone = HostedZone.fromLookup(this, 'tutorial.bright.dev', {
  domainName: 'tutorial.bright.dev'
});

const baseUrl = new URL(`https://cheap-ecs-fargate.${hostedZone.zoneName}`);

task.addContainer('caddy', {
  image: ContainerImage.fromRegistry('caddy:2-alpine'),
  command: [
    'caddy', 'reverse-proxy', '--from', baseUrl.hostname, '--to', '127.0.0.1:8080'
  ],
  portMappings: [{
    containerPort: 80
  }, {
    containerPort: 443
  }],
}).addContainerDependencies({
  container: backend,
  condition: ContainerDependencyCondition.START
})

As you can see, we configure caddy container to reverse-proxy traffic to our web service.
The caddy container will also listen on both HTTP and HTTPS ports.

Expose services publicly

Finally, for our service to be reachable from the internet, we need to configure DNS to point to our task instance
public ip.

However, given that we use a spot instance, which can and will often be replaced by AWS, we should automatically
update our DNS entry whenever the task is restarted. Thankfully, there's a
construct @raykrueger/cdk-fargate-public-dns available that will do it for us:

service.connections.allowFromAnyIpv4(Port.tcp(80), 'Http')
service.connections.allowFromAnyIpv4(Port.tcp(443), 'Https')

new PublicIPSupport(this, 'PublicIPSupport', {
  cluster,
  service,
  dnsConfig: {
    domainName: baseUrl.hostname,
    hostzedZone: hostedZone.hostedZoneId
  }
})

Summary

The above setup is not suited for the majority of production use. However, for non-production isolated use, it can
significantly reduce your monthly bill:

The full example can be found
in GitHub repository.

By Piotr Mionskowski, Head of Technology & Partner at Bright Inventions.

Build LLM application with RAG (LangChain v0.1.0)

bright inventions — Fri, 02 Feb 2024 11:20:42 +0000

Let’s build a simple LLM application in Python using the LangChain library as well as RAG and embedding techniques. Follow our step-by-step tutorial published after the new release of LangChain 0.1.0 in January 2024.

In previous blog posts, we have described how the embeddings work and what the RAG technique is. If you need to catch up with some basics, read the articles. Are you ready? Now it’s time to turn theory into practice!

How to build an LLM application from scratch

We will build a simple LLM application in Python using the LangChain library. LangChain is a popular library that makes building such applications very easy.

Our RAG application will expand an LLM's knowledge using private data. In this case, it will be a PDF file containing some text.

It's also possible to achieve a similar goal by using OpenAI agents and expanding their knowledge base with specific files by uploading them to OpenAI's servers for a designated agent. However, this method entails storing our confidential data with OpenAI's servers, which may not always align with our privacy preferences. My colleague – Rafał Hofman – wrote a great article about data privacy in OpenAI services.

As the file for expanding knowledge, we will use an article about 'ReAct', titled 'ReAct: Synergizing Reasoning and Acting in Language Models'. This article discusses a research project that integrates decision-making and reasoning skills in large language models.

1. Prerequisites

At the very beginning, we must install all required modules, that our application will use. Let’s write this command in the terminal in the project directory

pip install langchain-community==0.0.11 pypdf==3.17.4 langchain==0.1.0 python-dotenv==1.0.0 langchain-openai==0.0.2.post1 faiss-cpu==1.7.4 tiktoken==0.5.2 langchainhub==0.1.14

Let's create a ‘data’ directory and place the PDF file in it. We must also create a main.py file in the project directory, where we will store the whole code of our application.

In the main.py file, we will create main() function which will store the logic. The file will look like this:

def main():
  print("Hello World!")

if __name__ == "__main__": 
  main()

Great! Let's move on to the implementation of logic now.

2. Load the PDF file into the application

We will use a document loader provided by LangChain called PyPDFLoader.

from langchain_community.document_loaders import PyPDFLoader

pdf_path = "./data/2210.03629.pdf"

def main():
  loader = PyPDFLoader(file_path=pdf_path)
  documents = loader.load()
  print(documents) 

if __name__ == "__main__": 
  main()

First, we should create an instance of the PyPDFLoader object where we pass the path to our file. The next step is to simply call the load function on this object and save the loaded file in the documents variable. It will be an array consisting of Document objects, where each of these objects is a representation of one page of our file.

The print() function should output an array similar to this:

[Document(page_content='[...]', metadata={'source': pdf_path, page: 1}), Document(page_content='[...]', metadata={'source': pdf_path, page: 2}), ...]

3. Splitting document into smaller chunks

We don’t want to send a whole document as a context with our query to the LLM. Why? It was more detailedly described in the article about the RAG. To split the document, we will use a class provided by LangChain called CharacterTextSplitter, which we can import from the langchain library:

from langchain.text_splitter import CharacterTextSplitter

Then we can create an instance of it and call the split_documents() function, passing our loaded documents as a parameter.

def main():
  loader = PyPDFLoader(file_path=pdf_path) 
  documents = loader.load() 
  text_splitter = CharacterTextSplitter( chunk_size=1000, chunk_overlap=50, separator="\n" ) 
  docs = text_splitter.split_documents(documents)

Let's briefly describe what's going on here.

First, we are creating a CharacterTextSplitter object, which takes several parameters:

chunk_size - defines the maximum size of a single chunk measured in tokens.
chunk_overlap - defines the size of overlap between chunks. This helps to preserve the meaning of the split text by ensuring that chunks are not split in a way that would distort their meaning.
separator - defines the separator that will be used to delineate our chunks.

In the docs variable, we will get an array of Document objects - the same as from the load() function of the PyPDFLoader class. But this time, this array will contain more elements because we have split them.

4. Prepare environment variables and API Key to store it there

The next step will be converting these chunks into numeric vectors and storing them in a vector database. This process is called embeddings, and there is also a blog post about it, so we won't go into detail about it now.

For the embeddings process, we need an external embeddings model. We will use OpenAI embeddings for this purpose. To do that, we have to generate an OpenAI API key. \
But before that, we have to create a .env file where we will store this key.

Now, we need to create an account on the platform.openai.com/docs/overview page. Afterward, we should generate an API key on the platform.openai.com/api-keys page by creating a new secret key.

Copy the secret key and paste it into the .env file like this:

OPENAI_API_KEY=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

This key will be deleted before the publication of this post, so you will be not able to use it.

Okay, let’s load environment variables into our project by importing the load_dotenv function:

from dotenv import load_dotenv

And call it at the very beginning of the main function:

def main(): 
    load_dotenv()
    loader = PyPDFLoader(file_path=pdf_path) 
    documents = loader.load() 
    text_splitter = CharacterTextSplitter( chunk_size=1000, chunk_overlap=50, separator="\n" ) 
    docs = text_splitter.split_documents(documents)

5. Implementing the embedding process

At first, we have to import OpenAIEmbeddings class:

from langchain_openai import OpenAIEmbeddings

Then we should create an instance of this class. Let’s assign it to the 'embeddings' variable like this:

embeddings = OpenAIEmbeddings()

6. Setting up local vector database - FAISS

Awesome! We have loaded and prepared our file, and we have also created an object instance for the embeddings model. We are now ready to transform our chunks into numeric vectors and save them in a vector database. We will keep all our data locally using the FAISS vector database. Facebook AI Similarity Search (Faiss) is a tool designed by Facebook AI for effective similarity search and clustering of dense vectors.

First, we need to import the FAISS instance:

from langchain_community.vectorstores.faiss import FAISS

And implement the process of converting and saving embeddings:

def main(): 
    load_dotenv() 
    loader = PyPDFLoader(file_path=pdf_path) 
    documents = loader.load() 
    text_splitter = CharacterTextSplitter( chunk_size=1000, chunk_overlap=50, separator="\n" ) 
    docs = text_splitter.split_documents(documents) 
    embeddings = OpenAIEmbeddings() 
    vectorstore = FAISS.from_documents(docs, embeddings)    
    vectorstore.save_local("vector_db")

We have added two lines to our code. The first line takes our split chunks (docs) and the embeddings model to convert the chunks from text to numeric vectors. After that, we are saving the converted data locally in the 'vector_db' directory.

7. Creating a prompt

For preparing a prompt we will use a 'langchain' hub. We will pull a prompt called 'langchain-ai/retrieval-qa-chat' from there. This prompt is specially designed for our case, allowing us to ask the model about things from the provided context. Under the hood, the prompt looks like this:

Answer any use questions based solely on the context below:
<context> 
{context}
</context>

You can check it here - https://smith.langchain.com/ in the hub section, but you will have to create an account for that.

Let’s import a hub from the 'langchain' library:

from langchain import hub

Then, simply use the 'pull()' function to retrieve this prompt from the hub and store it in a variable:

retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

8. Setting up a large language model

Great. The next thing we'll need is a large language model - in our case, it will be one of the OpenAI models. Again, we need an OpenAI key but we have already set up it along with the embeddings, so we don't need to do it again.

Let's go ahead and import the model:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

And assign it to a variable in our main function:

llm = ChatOpenAI()

9. Retrieve context data from the database

Okay, we have finished preparing the vector database, embeddings, and LLM (large language model). Now, we need to connect everything using chains. We will need two types of chains provided by 'langchain' for that.

The first one is the 'create_stuff_documents_chain,' which we need to import from the 'langchain' library:

from langchain.chains.combine_documents import create_stuff_documents_chain

Next, pass our large language model (LLM) and prompt to it.

combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)

This function returns an LCEL Runnable object, which requires a context parameter. Running it will look like this:

combine_docs_chain.invoke({"context": docs, "input": "What is REACT in machine learning meaning?"})

10. Retrieve only the relevant data as a context

Generally, it will work, but in this situation, we will pass all chunks - the entire document - as the context. In our case, where the file has 33 pages, this context is too large, and we will probably encounter an error like this:

openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 4097 tokens. However, your messages resulted in 33846 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

To fix that, we need to pass only the information related to our query as the context. We will achieve this by combining this chain with another one, which will retrieve only the chunks important to us from the database and automatically add them as context to the prompt.

Let's import that chain from the 'langchain' library:

from langchain.chains import create_retrieval_chain

First, we need to prepare our database as a retriever, which will enable semantic search for the chunks that are relevant to our query.

retriever = FAISS.load_local("vector_db", embeddings).as_retriever()

So, we load our directory where we store the chunks converted to vectors and pass it to an embeddings function. In the end, we return it as a retriever.

Now, we can combine our chains:

retrieval_chain = create_retrieval_chain(retriever, combine_docs_chain)

Under the hood, it will retrieve relevant chunks from the database and add them to our prompt as context. All we have to do now is invoke this chain with our query as an input parameter:

response = retrieval_chain.invoke({"input": "What is REACT in machine learning meaning?"})

As a response, we will receive an object with three variables:

input - our query;
context - an array of documents (chunks) that we have passed as context to the prompt;
answer - the answer to our query generated by the large language model (LLM).

Let’s print out the "answer" property:

print(response["answer"])

Our printed answer looks as follows:

In the context provided, ReAct refers to an approach or methodology used in machine learning. It stands for "Reasoning + Acting" and aims to integrate decision-making and reasoning capabilities into a large language model. ReAct allows the model to interact with external sources, such as knowledge bases or environments, to gather additional information and improve its task-solving abilities. It has been applied to various language and decision-making tasks, demonstrating effectiveness over state-of-the-art baselines and improved interpretability and trustworthiness.

Looks pretty nice :)

10. You’ve made it! Our LLM app is ready

We have extended the knowledge base of the LLM model with data from a .pdf file. The model is now able to answer our questions based on the context that we have provided in the prompt.

Final code:

from dotenv import load_dotenv
from langchain import hub
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores.faiss import FAISS

pdf_path = "./data/2210.03629.pdf"


def main():
    load_dotenv()

    loader = PyPDFLoader(file_path=pdf_path)
    documents = loader.load()

    text_splitter = CharacterTextSplitter(
        chunk_size=1000, chunk_overlap=50, separator="\n"
    )
    docs = text_splitter.split_documents(documents)

    embeddings = OpenAIEmbeddings()

    vectorstore = FAISS.from_documents(docs, embeddings)
    vectorstore.save_local("vector_db")

    retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

    llm = ChatOpenAI()

    combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)

    retriever = FAISS.load_local("vector_db", embeddings).as_retriever()
    retrieval_chain = create_retrieval_chain(retriever, combine_docs_chain)

    response = retrieval_chain.invoke(
        {"input": "What is REACT in machine learning meaning?"}
    )

    print(response["answer"])


if __name__ == "__main__":
    main()

By Paweł Polak, Fullstack Developer @ Bright Inventions

Google Sign In with Cognito and Nest.js

bright inventions — Thu, 01 Feb 2024 11:12:29 +0000

If you want to implement Google sign-in, also called Google federation, and combine it with using AWS Cognito this blog
post if for you.

We'll use aws-cdk combined with Nest.js to achieve that.

Setup

At Bright Inventions, we often keep infrastructure code next to application code.
Thus let's start with creating:

Nest.js backend project

nest new backend

aws-cdk infrastructure project

mkdir infrastrucutre
(cd infrastrucutre && npx cdk@2 init --language=typescript)

Cognito UserPool

Cognito UserPool represents our users' directory. You can think of it as the repository of of user accounts.

export class CognitoGoogleAuthNestJs extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    const userPool = new UserPool(this, 'users', {
      selfSignUpEnabled: true,
      signInAliases: { email: true }
    });

    const userPoolDomain = userPool.addDomain('backend', {
      cognitoDomain: {
        domainPrefix: "tutorial-bright"
      }
    });

    new CfnOutput(this, 'user-pool-domain-uri', {
      value: userPoolDomain.baseUrl()
    })
  }
}

Setup Google API credentials

We need to enable Cognito to talk with Google APIs.
In our Google Cloud Platform project, let's enable OAuth consent screen:

Next configure application name and a domain of the oauth handling endpoints. AWS Cognito provides us with
those endpoints:

Remember that until your application is published, you can only use it with test users.
You'll be able to add test users while configuring OAuth consent screen.

Finally, create an oauth client id using the web application type:

Finally, you need to configure Authorized redirect URIs to point to
AWS Cognito IDP response endpoint
that will be in the form of: https://${domainPrefix}.auth.${region}.amazoncognito.com/oauth2/idpresponse

Download the oauth client credentials JSON file. We should not store the client credentials in our source code.
Let's use AWS Secret Manager for that:

 aws secretsmanager create-secret \
  --name cognito-google-oauth-credentials \
  --secret-string "$(jq '.web' < ~/Downloads/client_secret.apps.googleusercontent.com.json)"

The jq '.web' ... extract the nested web attribute content. This is required as you can't reference nested values in
AWS Secret Manager.

Configure OAuth clients

We need to instruct Cognito to be able to communicate with Google as the Identity Provider:

const clientCredentials = Secret.fromSecretNameV2(this, 'google-client-credentials', 'cognito-google-oauth-credentials')

userPool.registerIdentityProvider(new UserPoolIdentityProviderGoogle(this, "Google", {
  userPool,
  clientId: clientCredentials.secretValueFromJson("client_id").unsafeUnwrap(),
  clientSecret: clientCredentials.secretValueFromJson("client_secret").unsafeUnwrap(),

  // Email scope is required, otherwise we'll not get it
  scopes: ["email"],
  attributeMapping: {
    email: ProviderAttribute.GOOGLE_EMAIL,
  },
}));

Finally, we need to tell Cognito where to take the end user for after authentication. The callback url will be the url
of our Nest.js backend.

const hostedZone = HostedZone.fromLookup(this, 'tutorial.bright.dev', {
  domainName: 'tutorial.bright.dev'
});

const baseNestJsUrl = new URL(`https://nestj-google-cognito.${hostedZone.zoneName}`);

const callbackUrl = new URL("/auth/callback", baseNestJsUrl)

const userPoolClient = userPool.addClient('nest.js', {
  generateSecret: true,
  supportedIdentityProviders: [UserPoolClientIdentityProvider.GOOGLE],
  oAuth: {
    callbackUrls: [callbackUrl],
  },
});
// workaround for https://github.com/aws/aws-cdk/issues/15692
userPoolClient.node.addDependency(identityProviderGoogle)

Handle Cognito sign-in callback in Nest.js

After Cognito federates with Google OpenId Connect Provider, it passes the control to our application.
In essence, it is the Authorization code grant with PKCE.
Our application will receive a code that it has to exchange for Access Token, Id Token and Refresh Token using Cognito
APIs.

If your backend should automatically redirect unauthenticated API clients to OAuth authorize endpoint, then
use passport-oauth2.
To make our example more transparent, we invoke the token endpoint manually:

@Controller()
export class AuthController {
  constructor(private readonly configService: ConfigService<OAuthClientEnvConfiguration>) {
  }

  @Get("/auth/callback")
  async signIn(@Query('code') authorizationCode: string) {
    const clientId = this.configService.getOrThrow('OAUTH_CLIENT_ID')
    const clientSecret = this.configService.getOrThrow('OAUTH_CLIENT_SECRET')
    const authorizationEncoded = Buffer.from(`${clientId}:${clientSecret}`).toString("base64");

    const authParams = new URLSearchParams(Object.entries({
      client_id: clientId,
      code: authorizationCode,
      grant_type: "authorization_code",
      redirect_uri: this.configService.getOrThrow('OAUTH_CALLBACK_URL'),
    }));

    const tokenUrl = `${this.configService.getOrThrow('OAUTH_AUTHORIZATION_SERVER_URL')}/oauth2/token?` + authParams;

    const tokenData = await (await fetch(tokenUrl, {
      method: 'POST',
      headers: {
        Authorization: `Basic ${authorizationEncoded}`,
        "Content-Type": "application/x-www-form-urlencoded",
      },
    })).json();
    // tokenData has id_token, access_token and refresh_token
  }
}

At the end of the sign-in flow our application we have IdToken, AccessToken and RefreshToken.
What we do at this stage depends on our needs. For example, we can:

start a cookie-based session
return AccessToken to frontend
init user account configuration that do not fit into AWS Cognito ## Combine Cognito with Passport Nest.js Passport is often used in Node.js backends to deal with authentication. The passport-oauth2 extension provides an easy way to integrate with standard OAuth flows. Here's how to use it in Nest.js:

@Controller()
export class AuthController {
  constructor(private readonly configService: ConfigService<OAuthClientEnvConfiguration>) {
  }

  @UseGuards(AuthGuard('oauth'))
  @Get("/auth/callback")
  async signInPassport(@Req() req: Express.AuthenticatedRequest) {
    // req.user has id_token, access_token and refresh_token 
  }

}

// register in AppModule
@Injectable()
export class NestPassportOAuthStrategy extends PassportStrategy(OAuth2Strategy) {
  constructor(configService: ConfigService<OAuthClientEnvConfiguration>) {
    super({
      clientID: configService.getOrThrow('OAUTH_CLIENT_ID'),
      clientSecret: configService.getOrThrow('OAUTH_CLIENT_SECRET'),
      authorizationURL: `${configService.getOrThrow('OAUTH_AUTHORIZATION_SERVER_URL')}/oauth2/authorize`,
      tokenURL: `${configService.getOrThrow('OAUTH_AUTHORIZATION_SERVER_URL')}/oauth2/token`,
      callbackURL: configService.getOrThrow('OAUTH_CALLBACK_URL')
    } as OAuth2Strategy.StrategyOptions, (accessToken, refreshToken, results, profile, verified) => {
      console.log('verified', { accessToken, refreshToken, results, profile, verified })
    });
  }
}

Provide users with login URL

With AWS Cognito we can use hosted pages. However, we often need to have a full control over the UI of our
application.

In such a case, we can craft a special URL that will trigger the sign in flow. Here's how to create the URL that will
trigger login with Google flow:

const baseAuthUrl = this.configService.getOrThrow('OAUTH_AUTHORIZATION_SERVER_URL')
const clientId = this.configService.getOrThrow('OAUTH_CLIENT_ID')
const loginViaGoogleUrl = `${baseAuthUrl}/oauth2/authorize?${new URLSearchParams(Object.entries({
  client_id: clientId,
  identity_provider: 'Google',
  response_type: 'code',
  redirect_uri: this.configService.getOrThrow('OAUTH_CALLBACK_URL')
}))}`

The URL will look as follows:

https://{cognitoDomainPrefix}.auth.{awsRegion}.amazoncognito.com/oauth2/authorize?client_id={cognitoClientId}&identity_provider=Google&response_type=code&redirect_uri={yourApplicationCallbackUrl}

Please bear in mind that the client_id parameter is one retrieved from userPoolClient and not from Google project
API credentials.

ECS Task Definition

I'll spare you the details on how to run the Nest.js application in ECS. That's a topic for a separate blog post.
However, there are a couple of important configuration options that you need to provide for the above snippets to work:

 const backend = task.addContainer('backend', {
  image: ContainerImage.fromDockerImageAsset(new DockerImageAsset(this, 'backend-image', {
    directory: path.join(process.cwd(), '..', 'backend')
  })),
  environment: {
    PORT: '3000',
    OAUTH_CLIENT_ID: userPoolClient.userPoolClientId,
    OAUTH_CLIENT_SECRET: userPoolClient.userPoolClientSecret.unsafeUnwrap(),
    OAUTH_AUTHORIZATION_SERVER_URL: userPoolDomain.baseUrl(),
    OAUTH_CALLBACK_URL: callbackUrl.toString(),
  },
  portMappings: [{ containerPort: 3000 }],
  logging: LogDriver.awsLogs({
    streamPrefix: "backend",
    logGroup: logGroup
  })
});

Summary

The full code of the above setup is available in GitHub.
In our example, AWS Cognito performs OpenID Connect exchange with Google. Our Nest.js application code only receives
information from Cognito. We can easily integrate new identity providers e.g. Facebook and our backend application code would still work.

By Piotr Mionskowski, Head of Technology & Partner @ Bright Inventions

The Best Authentication Methods for Your App (Decision Tree)

bright inventions — Tue, 23 Jan 2024 09:26:32 +0000

Download our free ebook with an authentication method decision tree. We've taken into account user experience, regulatory requirements, privacy concerns, and security when selecting an authentication method for an application.

Download the free ebook with a decision tree

The decision tree included in our free ebook divides decision factors into 6 questions:

Here are examples of the questions:

Are there specific regulatory or compliance requirements?
What is the sensitivity of the application or data?
Does the application support multiple platforms and devices? Dive into the first question with this sample:

According to your answers, we prepared the proper recommendations. Download the free ebook to have the whole picture and streamline your sign-in process.

Data Deduplication in Python with RecordLinkage

bright inventions — Tue, 09 Jan 2024 14:32:39 +0000

Supervised Duplicate Detection with RecordLinkage and Pandas: A Febrl Dataset Tutorial

Introduction

Duplicate detection is a critical process in data preprocessing, especially when dealing with large datasets. Duplicate records can skew analyses and impact the accuracy of machine learning models. In this tutorial, we explore data deduplication using Python's RecordLinkage package, paired with Pandas for data manipulation. This approach is particularly valuable in contexts like customer database management, where duplicate entries can result in inefficient marketing and customer service strategies.

Setting Up the Environment with Miniconda

Ensure your environment is correctly set up using Miniconda:

Install Miniconda: Download and install Miniconda from here.
Create and Activate a New Conda Environment:

   conda create --name deduplication python=3.8
   conda activate deduplication

Install Required Packages:

   conda install -c conda-forge recordlinkage pandas

Step 1: Loading the Febrl Dataset

Utilize the RecordLinkage package to load the Febrl dataset, a synthetic dataset typical of what you might find in a customer database. This dataset contains duplicates and is structured with comprehensive personal details:

import pandas as pd
import recordlinkage

df_a, df_b = recordlinkage.datasets.load_febrl4()

Exploring the Dataset

Examine the dataset to understand its structure, which includes names, addresses, and other personal information:

print(df_a.head())

Example Output:

| rec_id       | given_name | surname  | street_number | address_1        | address_2       | suburb         | postcode | state | date_of_birth | soc_sec_id |
|--------------|------------|----------|---------------|------------------|-----------------|----------------|----------|-------|---------------|------------|
| rec-1070-org | michaela   | neumann  | 8             | stanley street   | miami           | winston hills  | 4223     | nsw   | 19151111      | 5304218    |
| rec-1016-org | courtney   | painter  | 12            | pinkerton circuit| bega flats      | richlands      | 4560     | vic   | 19161214      | 4066625    |
| rec-4405-org | charles    | green    | 38            | salkauskas crescent | kela          | dapto          | 4566     | nsw   | 19480930      | 4365168    |
| rec-1288-org | vanessa    | parr     | 905           | macquoid place   | broadbridge manor | south grafton | 2135     | sa    | 19951119      | 9239102    |
| rec-3585-org | mikayla    | malloney | 37            | randwick road    | avalind         | hoppers crossing| 4552     | vic   | 19860208      | 7207688    |

This Markdown table showcases the first few entries from the Febrl dataset df_a. It includes various fields like given_name, surname, address_1, and date_of_birth, providing a detailed view of the data structure used in the deduplication process.

Step 2: Data Preprocessing

Data preprocessing is a critical step in ensuring the quality of your deduplication efforts. The objective here is to clean and standardize your data, making it suitable for comparison. This involves addressing missing values, normalizing data formats, and potentially converting data types for consistency.

In the code snippet, we replace all missing values with empty strings in both datasets (df_a and df_b). This uniform approach to missing data ensures that comparisons are not skewed by null values.

df_a.fillna('', inplace=True)
df_b.fillna('', inplace=True)

Step 3: Indexing

Indexing is the process of creating candidate links between records, which might refer to the same entity. This step is crucial as it sets the stage for how records will be compared.

While indexer.full() creates a comprehensive index by comparing every record in one dataset (df_a) with every record in another (df_b), this method can be computationally expensive, especially for large datasets. An efficient alternative is to use a blocking method, such as indexer.block("given_name"). This approach significantly reduces the number of comparisons, thus speeding up the process.

Understanding Block Indexing:

Block indexing works by grouping records based on a specific attribute and only comparing records within the same group. In our example, we use:

indexer = recordlinkage.Index()
indexer.block("given_name")
candidate_links = indexer.index(df_a, df_b)

Here’s how it works:

Blocking by Given Name: By invoking indexer.block("given_name"), the RecordLinkage indexer will group records from both df_a and df_b based on the given_name attribute. Essentially, it creates blocks of records where the given names are the same.
Reduced Comparisons: Comparisons are only made between records within these blocks. For instance, a record with the given name 'John' in df_a will only be compared to records with the given name 'John' in df_b.
Efficiency: This focused approach significantly reduces the total number of comparisons needed. It's particularly effective in datasets where a high proportion of records can be excluded from comparison based on a single attribute.

When to Use Block Indexing:

Large Datasets: Ideal for large datasets where full indexing might be impractical due to computational constraints.
High-Quality Key Attribute: Most effective when there’s a reliable key attribute (like 'given_name') that can accurately group potential matches.

Trade-Offs:

Risk of Missing Matches: If the key attribute used for blocking has inconsistencies (like typos in names), potential matches might be missed.
Choosing the Right Attribute: The effectiveness of blocking depends on choosing an attribute that can effectively discriminate between matches and non-matches.

In summary, using indexer.block("given_name") offers an efficient way to perform indexing in duplicate detection tasks, especially when dealing with large datasets or seeking to optimize computational resources. It's a strategic choice in scenarios where the selected blocking attribute is reliable and consistent across the dataset.

Step 4: Comparing Records

Comparing records is the heart of the deduplication process. This step involves applying various algorithms to measure similarities between record pairs.

These comparisons yield a set of features indicating the level of similarity between each pair of records.

compare_cl = recordlinkage.Compare()
compare_cl.exact('given_name', 'given_name', label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.exact("date_of_birth", "date_of_birth", label="date_of_birth")
compare_cl.exact("suburb", "suburb", label="suburb")
compare_cl.exact("state", "state", label="state")
compare_cl.string("address_1", "address_1", threshold=0.85, label="address_1")

features = compare_cl.compute(candidate_links, df_a, df_b)

Example Output:

| rec_id_1     | rec_id_2     | given_name | surname |
|--------------|--------------|------------|---------|
| rec-1070-org | rec-282-org  | 1          | 0       |
| rec-1070-org | rec-1685-org | 1          | 0       |
| rec-1070-org | rec-1056-org | 1          | 0       |
| rec-1070-org | rec-1216-org | 1          | 0       |
| rec-1070-org | rec-1508-org | 1          | 0       |

Step 5: Classifying Matches

The classification step involves analyzing the comparison features to distinguish between matches and non-matches. A common approach is to set a threshold for the sum of comparison scores. Pairs scoring above this threshold are considered matches.

matches = features[features.sum(axis=1) > 3]
print(matches)

Example Output:

| rec_id_1     | rec_id_2     | given_name | surname |
|--------------|--------------|------------|---------|
| rec-1070-org | rec-5114-org | 1          | 1       |
| rec-1070-org | rec-3403-org | 1          | 1       |
| rec-1016-org | rec-1936-org | 1          | 1       |

Complete code

import pandas as pd
import recordlinkage
from recordlinkage.datasets import load_febrl4
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load the Febrl dataset
df_a, df_b = load_febrl4()

# Data Preprocessing
df_a.fillna('', inplace=True)
df_b.fillna('', inplace=True)

# Indexing - Create candidate links between records
indexer = recordlinkage.Index()
indexer.block("given_name")
candidate_links = indexer.index(df_a, df_b)

# Comparing Records
compare_cl = recordlinkage.Compare()
compare_cl.exact('given_name', 'given_name', label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.exact("date_of_birth", "date_of_birth", label="date_of_birth")
compare_cl.exact("suburb", "suburb", label="suburb")
compare_cl.exact("state", "state", label="state")
compare_cl.string("address_1", "address_1", threshold=0.85, label="address_1")

# Classifying Matches
features = compare_cl.compute(candidate_links, df_a, df_b)

# Analyzing the Results with Pandas
matches = features[features.sum(axis=1) > 3]

# Display the first few entries of the dataset and matched records
print("First few entries of df_a:")
print(df_a.head())
print("\nMatched Records:")
print(matches.head())
print("\nNumber of matched records:")
print(len(matches))

Summary

Supervised duplicate detection using the RecordLinkage package in Python is a powerful method for identifying duplicate records in datasets. This approach is particularly valuable in scenarios where maintaining a single, accurate record for each entity is crucial, such as in customer databases, medical records, and other similar applications. By following the steps of preprocessing, indexing, comparing, classifying, and analyzing, we can effectively identify and handle duplicate entries, leading to cleaner, more reliable datasets for further analysis or model training.

By Patryk Szlagowski, Senior Backend Developer @ Bright Inventions

Use WorkManager Mindfully and Don’t Make These Mistakes

bright inventions — Thu, 04 Jan 2024 12:59:17 +0000

WorkManager is a powerful tool, but with great power comes great responsibility. Is it always completely safe to use? In this article, we will discuss a few potentially dangerous situations related to WorkManager. We will focus on the inconsistency of Workers, which can be edited or removed over time.

Custom WorkerFactory and @AssistedInject

Let's face it, nowadays injecting dependencies into a Worker class is common and nearly inevitable. Our background work often requires sending a request using e.g. Retrofit service, saving something in the database using Dao or simply separating logic from Worker. It would be great if we could inject these dependencies directly into the Worker. Without dependency injection, Workers would not be so powerful. WorkManager creates Workers on its own by default. It expects the Worker to have a constructor with two parameters (Context and WorkerParameters). So how do we provide our dependencies there?

How to inject dependencies using Dagger 2

One of the most common practices is to create a custom WorkerFactory and @AssistedInject. Once we prepare our Assisted factories, we can create workers on our own using WorkerFactory. Here is a sample:

class CustomWorkerFactory @Inject constructor(
    private val workerFactories: Map<Class<out ListenableWorker>, @JvmSuppressWildcards MyWorkerAssistedFactory>
) : WorkerFactory() {

    override fun createWorker(
        appContext: Context,
        workerClassName: String,
        workerParameters: WorkerParameters
    ): ListenableWorker? {
        return workerFactories.entries
            .find {
                Class.forName(workerClassName).isAssignableFrom(it.key)
            }
            ?.value
            ?.create(appContext, workerParameters)
    }
}

It assumes that we are able to inject our custom-assisted factories, which will help us create Workers using only Context and WorkerParameters. For providing such factories, please see Assisted Injection documentation.

This code works fine, however there is one issue with it. If you rename, move or delete the Worker’s class and you had a work request scheduled for the class before modification, then the Class.forName(workerClassName) is going to throw ClassNotFoundException. It’s because WorkManager stores class names in its local database and it doesn’t track class modifications. Once WorkManager saves a particular class name, it’s going to stay in the database until the associated request is completed.

Here is a sample scenario showing how this situation can happen:

Create SyncDataWorker class and install the app
Turn off Wi-Fi and cellular data
Schedule a work request with the constraint of having a network available
Modify SyncDataWorker class to SyncWorker and install the app
Turn on Wifi or cellular data and wait for the Worker to start work
If work is not scheduled, please use the following ADB command to debug WorkManager and see if the WorkRequest is enqueued:

adb shell am broadcast -a "androidx.work.diagnostics.REQUEST_DIAGNOSTICS" -p "your.package.name"

The app should crash soon with the ClassNotFoundException

In order to fix this issue we can simply wrap the Class.forName(workerClassName) invocation with a try-catch statement:

class SafeWorkerFactory @Inject constructor(
    private val workerFactories: Map<Class<out ListenableWorker>, @JvmSuppressWildcards MyWorkerAssistedFactory>
) : WorkerFactory() {

    override fun createWorker(
        appContext: Context,
        workerClassName: String,
        workerParameters: WorkerParameters
    ): ListenableWorker? = try {
        workerFactories.entries
            .find {
                Class.forName(workerClassName).isAssignableFrom(it.key)
            }
            ?.value
            ?.create(appContext, workerParameters)
    } catch (e: ClassNotFoundException) {
        println("Class not found thrown!!!")
        e.printStackTrace()
        null
    }
}

Now if there was some deprecated class name in the WorkManager’s storage, we won’t encounter a crash due to ClassNotFoundException being thrown. Of course, the request won’t execute, but we’ll talk about it further in this blog.

How to inject dependencies using Hilt

Hilt made it all easier for you. In order to use it for the WorkManager configurations you need to add the following dependency to your project:

implementation("androidx.hilt:hilt-work:<newest_version>")

It provides an already existing safe HiltWorkerFactory ready to be used. This factory also uses Class.forName to get the Worker class by its name, but it’s wrapped with a try-catch statement already. This factory is ready to be injected out of the box once you add Hilt dependency - you don’t need to provide it on your own.

It works together with @HiltWorker annotation which you should add over your Worker class.\
It looks more or less like this:

@HiltAndroidApp
class SafeWorkManagerApp : Application(), Configuration.Provider {

    @Inject
    lateinit var hiltWorkerFactory: HiltWorkerFactory

    override fun getWorkManagerConfiguration(): Configuration {
        return Configuration.Builder()
            .setWorkerFactory(hiltWorkerFactory)
            .build()
    }
}

@HiltWorker
class SyncDataWorker(
    context: Context,
    workerParameters: WorkerParameters,
    @Assisted
    someOtherDependencyProvidedByHilt: SomeOtherDependencyProvidedByHilt 
): Worker(context, workerParameters) {
    override fun doWork(): Result {
        // define work and return Result
    }
}

Having this code, you’re ready to go. You can use WorkManager and enqueue work requests with assisted injection.

Any other dangers?

Well, we are covered in terms of catching ClassNotFoundException, but is it completely safe? Well... It depends!

Imagine an OfflinePaymentWorker that is supposed to synchronise offline payments with your backend. Now, you requested a work request for this Worker and it hasn’t completed yet. Then if you e.g. change the name of the Worker from OfflinePaymentWorker to SyncOfflinePaymentsWorker and install the app, you won’t sync outstanding work requests, because our safe factories would return null instead of an actual Worker. You could lose critical data about the payments.

That’s why you have to be always mindful about the Worker changes you introduce. Just keep in mind that WorkManager can store some incomplete work requests in it’s storage and modifying your Worker class might make them impossible to execute.

What to do to prevent losing your data?

Well, there are many approaches you can take. The most obvious one is to keep the old Worker and adjust only the logic - don’t delete it, move it or change the name. The downside of it, is that once you introduce a critical data sync Worker, it probably going to stay with you forever because theoretically, you are never sure if every task in the field has been executed or not.

There are other approaches as well, here is the last one that I am going to present. Instead of relying on WorkManager to store your data in a work request, you could store the critical data in your own storage like SharedPreferences or SQLite database. In other words instead of doing this:

fun enqueueWork() {
    val request = OneTimeWorkRequestBuilder<SyncDataWorker>()
        .setInputData(
            workDataOf(
                "data1" to "value1",
                "data2" to "value2",
            )
        ).build()

    WorkManager.getInstance(this)
        .beginUniqueWork(..., request)
        .enqueue()
}


class SyncDataWorker(
    appContext: Context,
    workerParams: WorkerParameters
) : Worker(appContext, workerParams) {

    override fun doWork(): Result {
        val data1 = inputData.getString("data1") ?: return Result.failure()
        val data2 = inputData.getString("data2") ?: return Result.failure()


        // some logic to synchronise data
        return Result.success()
    }
}

you could store the data1 and data2 values in SQLite Database as a single row representing a work that has to be executed and then create a Worker that would synchronise all of the remaining data from the database:

fun enqueueWork() {
    val syncDataRequest = OneTimeWorkRequestBuilder<SyncDataWorker>()
        .build()

    WorkManager.getInstance(this)
        .beginUniqueWork(..., syncDataRequest)
        .enqueue()
}

@HiltWorker
class SyncDataWorker(
    appContext: Context,
    workerParams: WorkerParameters,
    @Assisted
    syncDataDao: SyncDataDao
) : Worker(appContext, workerParams) {

    override fun doWork(): Result {
        val allDataToSynchronise = syncDataDao.getAll()

        allDataToSynchronize.forEach {
            // some logic to synchronise data
        }

        return Result.success()
    }
}

This way you won’t lose critical data if you modify or remove your Worker class. You would still have it in your database and you would be able to synchronise it in one way or another.

Summary

We have to be mindful of our Workers and make sure that modifying or removing them is not going to cause some issues for our business.

What else do you do to keep WorkManager work safe? Share in the comments!

By Szymon Miloch, Android & Web Developer @ Bright Inventions

The Role of Performance Budgets in Modern Frontend Development

bright inventions — Fri, 15 Dec 2023 06:50:45 +0000

Performance is a vital component of a good user experience, and we have learned it affects business metrics. In other words, an application that doesn't perform well will cost you greatly. How can one ensure that performance will remain at acceptable levels? To achieve a goal, you must first define it. That's when performance budgets come into play.

Budgets to rescue

A performance budget is a limit that all developers agree not to exceed in any circumstances. Basically, you can treat it like a monthly financial budget. If you want to make something stand out, then you will probably have to let something else go. It's fluid - depending on the business requirements for this month, you can decide to adjust it. For example, you can reduce the number of images in exchange for additional JavaScript being shipped. Budgeting is not only about the size of images, scripts, and other resources. This principle may also be applied to metrics like FCP (First Contentful Paint), TTI (Time To Interactive), or scores reported by tools like Lighthouse.

Having budgets defined for your application may spark a discussion about performance and get everyone on your team on the same page. They make designers limit high-resolution images and fonts until they are absolutely necessary. On the other hand, software engineers may easily evaluate the performance of different libraries and frameworks and compare them based on their influence on budgets.

Choosing metrics

Quantity-based metrics

Rules based on this type of metrics are the easiest to establish and enforce. They are based on raw values like the weight of JavaScript files, the number of HTTP requests, fonts, or images. However, they may not reflect the user experience correctly.

Milestone timings

In order to keep the user experience at an acceptable level, it may be better to focus on time-based metrics like Time to Interactive or First Contentful Paint. You can also define your metrics depending on what is the most important action from the perspective of your users. It's also possible to combine multiple milestones together to even better describe the path of a user in the application.

Rule-based metrics

These metrics use performance scores calculated by tools like Lighthouse, which you can use as guidelines. What's even better, such tools provide hints on how to make your application perform better.

Defining a budget

There is no way to provide a universal set of rules that will make sense for every application. However, there are some good defaults to start with:

under 5 seconds Time to Interactive,
under 170 KB of critical-path resources.

The best thing you can do is to analyze your competition and see how they perform. Then in the worst-case scenario, you will match them and provide a similar experience to your users. On the other hand, you may enforce lower limits and outperform them - it's up to you and your team.

It's worth mentioning budgets should be unambiguous. There is no use in a rule saying: "our home page must load and get interactive in less than 5 seconds on a slow device." What's a slow device? Three-year high-end device or maybe a $100 smartphone released 5 years ago? I recommend doing a short research and using an exact model instead.

There may be a different set of budgets enforced for different kinds of pages in your application. It's usually crucial for your home page to load as quickly as possible, but users may wait a little more for other screens.

It's not the easiest task to define a reasonable budget. Check out performancebudget.io, which will serve as a visual aid with presets for different network speeds.

Making sure your team stays within budgets

There are many tools to choose from when it comes to enforcing budgeting in your application. It all depends on the time you want to spend on research and configuration.

The most basic one is bundlesize, which will check if your bundle stays within reasonable boundaries. This way, engineers in your team won't be able to merge any pull requests that contain additional imports of expensive libraries.

If you want to make sure that your builds stay green in the Lighthouse audits, then you should familiarize yourself with lighthouse-ci. It makes it possible to run audits in your CI pipeline and define rules that should never be broken. To name one, you can say that your application is meant to score over 90 points in every audit, and then your CI will fail when any score drops below the threshold. What's even better is that it's possible to limit asset size or make assertions on your custom metrics. It's a versatile tool, a must-have for every web developer.

It's worth mentioning that Webpack is also capable of enforcing asset size limits. In its default configuration, this bundler will display a warning in the console if some scripts or images are too large. However, you can reconfigure it to throw an error instead. Consult its documentation to learn how to enable this feature.

Discussing budgets with decision-makers

We all have been here. You are working hard to ensure your application loads and can be used quickly, but then there comes this one meeting, and you see you will have to completely redesign the home page and put tons of images and other visual elements on it. You are aware that it will have a huge impact on the load time, and so you try to minimize losses, but they don't want to listen to you.

It's a fact that there is a constant struggle going on between stakeholders and engineers. We often tend to disagree or even don't understand each other. It's likely common that non-engineering members of your team are not aware of the performance consequences of their decisions. That's not their job. It's up to us to explain and present it to them in the clearest way possible.

With budgets in place, you can say that bringing this additional carousel of images will make us miss the 5-second deadline for page load. That's something easily understandable for everyone. Moreover, having those limits in place allows you to move the discussion back in time to the design stage. This will save you a lot of time, which you will be able to spend on something else.

We have to go over the budget!

Congratulations - you have enforced a strict budget in your application, and it has already prevented several changes that would degrade performance by accident. However, as products tend to grow over time, you have been adding more and more features to yours, and now you cannot do it anymore because your budget is exhausted. What should you do?

You have to compromise. You can either:

get back to previously added features and optimize them,
decide to remove some feature to make place for a new one (or postpone interactivity with it),
completely abandon your idea and don't ship another feature.

As with a financial budget, when you go over the limit, then you have to reduce spending on leisure and move funds to bills instead. The same principle applies here. That's why it's so important to have both engineers, designers, and stakeholders on the same page. We all have to cooperate to answer the question and provide the best possible experience.

In conclusion, performance budgets are an invaluable tool for ensuring that your application consistently delivers a top-notch user experience. By setting clear limits and guidelines for metrics, such as Time to Interactive and resource sizes, you can keep your team aligned and focused on optimizing performance from the design stage itself. These budgets also facilitate productive discussions with stakeholders, helping them understand the trade-offs between features and performance. However, it's essential to remain flexible and be ready to compromise when you inevitably reach the limits of your budget. Remember that it's a collaborative effort involving engineers, designers, and decision-makers to provide the best possible user experience and maintain a healthy performance balance in your application.

By Szymon Chmal, Senior Frontend Developer @ Bright Inventions