DEV Community: guoliwu

How to Manage Jina Resources with Namespace

guoliwu — Fri, 24 Feb 2023 07:52:56 +0000

Over the past year, we’ve been rapidly expanding Jina AI Cloud, starting with our Executor Hub, and now encompassing DocumentArray storage, hosted Flows and cloud apps.

Jina AI Cloud：https://cloud.jina.ai/

That’s a lot of stuff to manage! We’re introducing user namespaces to make things easier for all our users.

Namespace

Previously, if Alice and Bob both wanted to push a DocumentArray called fashion-mnist, whoever pushed first would get the name. That means Bob might have to go with fashion-mnist2 or similar. With more people using Jina AI Cloud, naming conflicts could become commonplace.

With user namespaces, both Alice and Bob can have their own fashion_mnist DocumentArrays (or Executors with the same name) with no fear of naming conflicts.

The new namespace apply to two important resources inside Jina AI ecosystem: DocumentArray and Executor.

Schema

Moving forwards, names for DocumentArrays and Hub Executors will follow the new schema, namespace/resource:

💡Note that Executors now use jinaai:// as the prefix (not jinahub://) and no longer need secrets (as you already logged in).

Access scope

For both DocumentArray and Executor, the user enjoys the following accessibility:

Does this break anything?

There are no breaking changes for existing Executors on Executor Hub.
You can still pull old DocumentArrays by their original name, but you can’t update them. Newly-pushed DocumentArrays must follow the username/da-name scheme.

Managing your resources under the namespace

Manage DocumentArrays

Create a Jina AI account at cloud.jina.ai.

Upgrade docarray to >=0.19.1 version with pip install -U docarray.
Log in via docarray.login() and start pushing and pulling DocumentArrays:

from docarray import DocumentArray

docarray.login()  # log in as 'alice'

docs = DocumentArray.pull('alice/fashion_mnist')

docs.push('alice/fashion_mnist_updated')

Manage your pushed DocumentArrays on Jina AI Cloud in the "Storage" tab:

Manage Executors

Upgrade jina to the latest version with pip install -U jina.

1.Create a Jina AI account at cloud.jina.ai.
2.Use an existing Executor in your Flow (in Python):

from jina import Flow

flow = Flow().add(uses='jinaai+docker://alice/MyExecutor')  # note jinaai

with flow:
    ...

3.Or push your own Executor to Executor Hub (from the CLI):

jina auth login
jina hub push MyExecutor

4.Manage your Executors on Executor Hub on the "Executors" tab:

This Week(s) in DocArray

guoliwu — Thu, 23 Feb 2023 11:10:52 +0000

It’s already been a month since the last alpha release of DocArray v2. Since then a lot has happened: we’ve merged features that we’re really proud of and keep crying tears of joy and misery trying to coerce Python into doing what we want. If you want to learn about interesting Python edge cases or follow the advancement of DocArray v2 development then you’re at the right place in this dev blog!

For those who don’t know, DocArray is a library for representing, sending, and storing multi-modal data, with a focus on applications in ML and Neural Search. The project just moved to the Linux foundation AI and Data and to celebrate its first birthday we decided to rewrite it from scratch, mainly because of a design shift and a will to solidify the codebase from the ground up.

MultiModalDataset

As part of our goal to make DocArray the go-to library for representing, sending, and storing multi-modal data, we‘ve added a MultiModalDataset class to easily convert DocumentArrays into PyTorch Dataset compliant datasets that can be used in the PyTorch DataLoader.

All you need is a DocumentArray and a dictionary of preprocessing functions and you’re up and running!

from docarray import BaseDocument, DocumentArray
from docarray.data import MultiModalDataset
from docarray.documents import Text
from torch.utils.data import DataLoader

class Thesis(BaseDocument):
    title: Text

class Student(BaseDocument):
    thesis: Thesis

da: DocumentArray[Student] = get_students()
ds: MultiModalDataset[Student] = MultiModalDataset[Student](
    da,
    preprocessing={'thesis.title': embed_title, 'thesis': normalize_embedding},
)
loader: DataLoader = DataLoader(
    ds, batch_size=4, collate_fn=MultiModalDataset[Student].collate_fn
)

# Use your loader just like any other dataloader for awesome DL training

If you’re interested in using DocArray for training, check out our example notebook, or take a peek at implementation details of MultiModalDataset.

TensorFlow support

After recently adding PyTorch support, we’ve now gone on to add TensorFlow support to DocArray v2. Like with PyTorch, we planned on subclassing the tensorflow.Tensor class with our TensorFlowTensor class. By doing so we want to allow DocArray to run operations on it while also being able to hand over our TensorFlowTensor instance to ML models or TensorFlow functions without TensorFlow being confused about this instance’s class but instead recognizing it as its own. Since we implemented this for PyTorch already, this should be easy, right?

But stop, not so fast. At first glance, TensorFlow tensors seem to be of class tf.Tensor, right?

import tensorflow as tf

tensor = tf.zeros((5,))
tensor

<tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 0., 0., 0., 0.], dtype=float32)>

When trying to subclass tf.Tensor though, we notice that this does not seem to work:

from typing import Any, Type, Union, cast

import tensorflow as tf
from docarray.typing.tensor.abstract_tensor import AbstractTensor
from pydantic.tools import parse_obj_as

class TensorFlowTensor(AbstractTensor, tf.Tensor):
    @classmethod
    def validate(cls, value, field, config) -> 'TensorFlowTensor':
        if isinstance(value, tf.Tensor):
            value.__class__ = cls
            return cast(TensorFlowTensor, value)
        else:
            raise ValueError(f'Expected a tf.Tensor, got {type(value)}')

our_tensor = parse_obj_as(TensorFlowTensor, tf.zeros((5,)))  # will fail

Parsing a tf.Tensor as TensorFlowTensor will fail:

pydantic.error_wrappers.ValidationError: 1 validation error for ParsingModel[TensorFlowTensor]
__root__
  __class__ assignment: 'TensorFlowTensor' object layout differs from 'tensorflow.python.framework.ops.EagerTensor' (type=type_error)

But wait, here they talk about an EagerTensor, not tf.Tensor. This is because TensorFlow actually supports eager execution and as well as graph execution. It defaults to eager execution, where operations are evaluated immediately. In graph execution, a computational graph is constructed for later evaluation.

So maybe we just need to extend TensorFlow’s EagerTensor then!

This, however, doesn’t work either, because the class EagerTensor is created on the fly, which is why trying to extend this class will fail with:

TypeError: type 'tensorflow.python.framework.ops.EagerTensor' is not an acceptable base type.

With all that being said, we’ve decided to go with the following solution for now:

Instead of extending TensorFlow’s tensor, we store a tf.Tensor instance as an attribute of our TensorFlowTensor class. Therefore if you want to perform operations on the tensor data or hand it over to your ML model, you have to explicitly access the .tensor attribute:

import tensorflow as tf
from docarray.typing import TensorFlowTensor

t = TensorFlowTensor(tensor=tf.zeros((224, 224)))

# tensorflow functions
broadcasted = tf.broadcast_to(t.tensor, (3, 224, 224))
broadcasted = tf.broadcast_to(t.unwrap(), (3, 224, 224))
broadcasted = tf.broadcast_to(t, (3, 224, 224))  # this will fail

In future we plan to take a closer look and find a solution that enables handling TensorFlowTensors just like our TorchTensors. In particular, we plan to investigate if there’s an equivalent in TensorFlow to Torch’s __torch_function__(), which we told you about in the previous blog post. With such an equivalent and some tricks here and there we hope to enable smooth usage or our TensorFlowTensor class and make it feel like it’s a subclass of TensorFlow’s tensor, without it actually being one.

Nested class and multiprocessing

As part of our goal to make DocArray the go-to library for representing, sending, and storing multi-modal data, it’s important that DocumentArrays support multiprocessing, namely processing on multi-CPU cores.

In particular, we recently implemented a MultiModalDataset class to easily convert a DocumentArray into a dataset that can be used in the PyTorch DataLoader. The PyTorch DataLoader wraps the Python multiprocessing module to implement preprocessing with multiple CPUs.

The problem

One of the well-known issues with multiprocessing is that it doesn’t support classes that are declared inside a function:

def get_class():
    class B:
        ...

    return B

MyClass = get_class()

def foo(*args):
    return MyClass()

import multiprocessing as mp

with mp.get_context('fork').Pool(2) as p:
    print(p.map(foo, range(2)))

Traceback (most recent call last):
  File "/Users/jackmin/Jina/docarray/meow.py", line 13, in <module>
    print(p.map(foo, range(2)))
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[<__main__.get_class.<locals>.B object at 0x10152e950>]'. Reason: 'AttributeError("Can't pickle local object 'get_class.<locals>.B'")'

Pickling

This is because multiprocessing uses pickle to share objects with workers. Pickling only saves the qualified class name of an object and unpickling requires re-importing the class by its qualified class name. For that to work, the class needs a global qualified name. Classes defined by functions are local and thus cannot be pickled:

def get_class():
    class B:
        ...

    return B

MyClass = get_class()

import pickle

pickle.dump(MyClass(), open('meow.pkl', 'wb'))

Traceback (most recent call last):
  File "/Users/jackmin/Jina/docarray/meow.py", line 10, in <module>
    pickle.dump(MyClass(), open("meow.pkl", "wb"))
AttributeError: Can't pickle local object 'get_class.<locals>.B'

In order to get around this, we need to make the declared class global:So maybe we just need to extend TensorFlow’s EagerTensor then!

This, however, doesn’t work either, because the class EagerTensor is created on the fly, which is why trying to extend this class will fail with:

TypeError: type 'tensorflow.python.framework.ops.EagerTensor' is not an acceptable base type.

With all that being said, we’ve decided to go with the following solution for now:

import tensorflow as tf
from docarray.typing import TensorFlowTensor

t = TensorFlowTensor(tensor=tf.zeros((224, 224)))

# tensorflow functions
broadcasted = tf.broadcast_to(t.tensor, (3, 224, 224))
broadcasted = tf.broadcast_to(t.unwrap(), (3, 224, 224))
broadcasted = tf.broadcast_to(t, (3, 224, 224))  # this will fail

Nested class and multiprocessing

The problem

One of the well-known issues with multiprocessing is that it doesn’t support classes that are declared inside a function:

def get_class():
    class B:
        ...

    return B

MyClass = get_class()

def foo(*args):
    return MyClass()

import multiprocessing as mp

with mp.get_context('fork').Pool(2) as p:
    print(p.map(foo, range(2)))

Traceback (most recent call last):
  File "/Users/jackmin/Jina/docarray/meow.py", line 13, in <module>
    print(p.map(foo, range(2)))
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[<__main__.get_class.<locals>.B object at 0x10152e950>]'. Reason: 'AttributeError("Can't pickle local object 'get_class.<locals>.B'")'

Pickling

def get_class():
    class B:
        ...

    return B

MyClass = get_class()

import pickle

pickle.dump(MyClass(), open('meow.pkl', 'wb'))

Traceback (most recent call last):
  File "/Users/jackmin/Jina/docarray/meow.py", line 10, in <module>
    pickle.dump(MyClass(), open("meow.pkl", "wb"))
AttributeError: Can't pickle local object 'get_class.<locals>.B'

In order to get around this, we need to make the declared class global:

def get_class():
    global B

    class B:
        ...

    return B

MyClass = get_class()

import pickle

pickle.dump(MyClass(), open('meow.pkl', 'wb'))

We can now load the pickles in a separate process as long as the process has a declaration of our class:

def get_class():
    global B

    class B:
        ...

    return B

MyClass = get_class()

import pickle

print(pickle.load(open('meow.pkl', 'rb')))

It doesn’t really matter how it ends up in the global scope. We can even do this:

class B:
    ...

import pickle

print(pickle.load(open('meow.pkl', 'rb')))

The fix?

Ok. It just wants it to be global. Simple enough right? Let’s just plop global in front of our declaration and be done with it.

def get_class():
    global B

    class B:
        ...

    return B

MyClass = get_class()

def foo(*args):
    return MyClass()

import multiprocessing as mp

with mp.get_context('fork').Pool(2) as p:
    print(p.map(foo, range(2)))

Yay this runs fine. But, what if our function returns a different class depending on the input arguments? I mean, why else would I want to return a class from a function?

def get_class(version: int):
    global B

    class B:
        VERSION: int = version

    return B

C1 = get_class(1)
C2 = get_class(2)

def get_version(cls):
    print(cls)
    return cls.VERSION

import multiprocessing as mp

with mp.get_context('fork').Pool(2) as p:
    print(p.map(get_version, [C1, C2]))

<class '__main__.B'>
Traceback (most recent call last):
  File "/Users/jackmin/Jina/docarray/meow.py", line 19, in <module>
    print(p.map(get_version, [C1, C2]))
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 540, in _handle_tasks
    put(task)
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class '__main__.B'>: it's not the same object as __main__.B

Can't pickle <class '__main__.B'>: it's not the same object as __main__.B. What does that mean?

Double declaration

Well, our little trick has some caveats. By performing a global declaration, we’re essentially taking the class declaration out into the top-level scope. This means we’re essentially doing this:

class B:
    VERSION: int = 1

C1 = B

class B:
    VERSION: int = 2

C2 = B

def get_version(cls):
    print(cls)
    return cls.VERSION

import multiprocessing as mp

with mp.get_context('fork').Pool(2) as p:
    print(p.map(get_version, [C1, C2]))

If we run this code, we get the exact same error we got before:

<class '__main__.B'>
Traceback (most recent call last):
  File "/Users/jackmin/Jina/docarray/wow.py", line 15, in <module>
    print(p.map(get_version, [C1, C2]))
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/pool.py", line 540, in _handle_tasks
    put(task)
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/Users/jackmin/miniconda3/envs/docarray/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class '__main__.B'>: it's not the same object as __main__.B

What happened here? By declaring the class twice, we’ve overwritten our first Class B with a second Class B in the global scope. Pickle is aware of this when it tries to serialize C1. It will notice that the Class B C1 refers to is no longer the top-level one and raises an exception.

Qualified names must be unique

The issue here is that both Class Bs have the same qualified name. Thus, both definitions are fighting over who gets to be the one the global dictionary knows about.

We can resolve this conflict and allow our two classes to live together peacefully by moving them to different qualified names and thus, different keys in the global scope:

def get_class(version: int):
    global B

    class B:
        VERSION: int = version

    B.__qualname__ = B.__qualname__ + str(version)
    globals()[f'B{version}'] = B
    return B

C1 = get_class(1)
C2 = get_class(2)

def get_version(cls):
    print('Class Name:', cls.__name__)
    print('Class Qualified Name:', cls.__qualname__)
    print('Type repr', cls)
    return cls.VERSION

import multiprocessing as mp

with mp.get_context('fork').Pool(2) as p:
    print(p.map(get_version, [C1, C2]))

Class Name: B
Class Qualified Name: B1
Type repr <class '__main__.B1'>
Class Name: B
Class Qualified Name: B2
Type repr <class '__main__.B2'>
[1, 2]

Notice that although the two classes have different qualified names, they can still share the same name with no issues. Printing the type does however show the qualified name.

Implementation example

If you’d like to see how we used this pattern to implement DocumentArrays that work with multiprocessing, check out this PR.

Support Proto 3 and 4

Protobuf introduced a breaking change in their 4.21 release. This has had a big impact on the Python ecosystem, and a lot of libraries have not yet been updated to use version 4.x. Perhaps the biggest pain for the ML ecosystem is TensorFlow’s lack of support for Protobuf, as it’s a widely used library and many packages, including DocArray, depend on it.

At the same time, DocArray can be used without TensorFlow — It’s just one of several available backends. To better support all users, we’ve decided to support both versions of protobuf.

This is actually easier than it may sound. We simply generated two Python files with Protoc, one for each of the Protobuf versions we want to support (3.x and 4.x).

So, depending on the protobuf version you have installed, we either load the first or the second version of the proto file. It’s as straightforward as that. Here is the PR for the curious.

Join the conversation

Want to keep up to date or just have a chat with us? Join our Discord and say hi!

Join Discord

Author

Sami Jaghouar,Alex C-G,Charlotte Gerhaher,Jack Min Ong

Qriginal Link

https://jina.ai/news/this-week-in-docarray-2/https://jina.ai/news/this-week-in-docarray-2/

Fine-tuning with Low Budget and High Expectations

guoliwu — Wed, 22 Feb 2023 06:42:41 +0000

Fine-tuning is a transfer learning technique developed as part of the Deep Learning revolution in artificial intelligence. Instead of learning a new task from scratch, fine-tuning takes a pre-trained model, trained on a related task, and then further trains it for the new task. Alternately, it can mean taking a model pre-trained for an open domain task, and further training it for a domain-specific one.

Compared to training from scratch, fine-tuning is a much more cost-efficient solution whenever it is feasible. It requires:

less labeled data, as there is no need to learn everything all over again. All the training is devoted to acquiring domain-specific knowledge.
less time to train, since the number of variables is much smaller and most layers in the deep neural network freeze during fine-tuning.

Leveraging and transferring pre-existing training to new problems is one of the major practical developments of the Deep Learning revolution. It is highly effective, economical, and environmentally friendly. This is especially true for small businesses and individuals that hope to take advantage of new AI technologies.

Or at least that's what all the deep learning tweets will tell you.

But if you think about it, or try to use fine-tuning in a real world use-case, you will quickly find out that the promise comes with a lot of caveats:

Exactly how much data do you need to get a good result? One labeled data point? Ten? One thousand? Ten thousand?
Exactly how much time do you need to get good results? One minute of fine-tuning? An hour? A day? A week?

These are not trivial questions, even for large enterprises, but they are especially critical to SME's and individuals who have limited resources to invest in AI. Domain-specific data is neither free nor error-free and requires costly human labor to generate. Top of the line GPU pipelines are frighteningly expensive to buy and maintain, with most enterprises renting time on a cloud service. An unplanned AWS bill in the thousands of euros is unwelcome at the best of times.

This article will give you a quantitative answer to these questions, using the Jina AI Finetuner. This tool is designed to improve the performance of pre-trained models and make them production-ready without expensive hardware.

Experiment design

We designed two experiments to quantitatively study how labeled data and training time affect fine-tuning performance. For each experiment, we construct three multimodal search tasks by fine-tuning three deep neural networks. We chose seven datasets, two of which are non-domain-specific public datasets, to ensure the generality of our experiment.

We measure the performance of fine-tuned models by evaluating their ability to perform search tasks, as measured by Mean Reciprocal Rank (mRR), Recall, and Mean Average Precision (mAP). These metrics are calculated using the top 20 results of each search in the validation subset held out from each dataset.

The table below summarizes the tasks, models and datasets used in our experiments, as well their performance metrics without any fine-tuning.

We already knew, even before performing any experiments, that all else being equal, more labeled data and more training time positively influence performance. But it's not enough to say that. We need to know how much is enough?

The overarching question of our experiment is:

Can we estimate the minimum domain- and task-specific labeled data and training time to deliver an adequate performance?

How much labeled data is needed for good fine-tuning?

We gradually increase the amount of labeled data fed to Finetuner from 100 items to 100,000 and see how this affects performance on the metrics described in the previous section.

We further calculate the return on investment (ROI), by dividing the relative improvement (a proxy for net profit) by the amount of labeled data (a proxy for investment cost). This is useful because it indicates the point at which adding more data is producing diminishing returns.

In the figures below, the X-axis represents the amount of labeled data, and the Y-axis represents the relative improvement over the pre-trained model. The higher, the better.

These results are promising but not particularly surprising. Performance improves with more labeled data on nearly all tasks and all datasets, more for some tasks and datasets than for others. However, the only conclusion we can draw from these figures is that the Finetuner works as advertised. So far so good.

What is more interesting is the ROI curve. In the figures below, the X-axis represents the amount of labeled data, and the Y-axis represents the ROI per labeled data item. The higher, the better. In particular, ROI=0 means adding new labeled data at that point no longer contributes to any improvement.

Surprisingly, we can see that the ROI per unit of new labeled data starts to drop almost immediately. We expected that it would eventually decrease, but this is an unexpected result.

How much time is needed for fine-tuning?

To measure the value of added training time, we fixed the amount of new labeled data to 1000 items, and then we gradually increased the number of training epochs from 1 to 10. At each increase, we measure improvement over the pre-trained model and calculate the ROI. For these experiments, the ROI is calculated by dividing the relative improvement by the elapsed time in seconds. This means that when ROI=0 , adding training time no longer improves performance.

We knew in advance that adding more time does not guarantee any improvement at all. It can, in fact, reduce performance due to the overfitting problem. Some models (e.g. CLIP) are more prone to overfitting than others. In principle, if we keep training with the same 1000 data points over and over, we are guaranteed to overfit the data and the overall performance will drop.

Let's look at the ROI curves.

The ROI drops immediately after the first epoch of fine-tuning. Unlike in the last experiment, where ROI approached zero but stayed positive when increasing the labeled data, here, the ROI on added time can go negative due to the overfitting problem!

Summary

What does this mean for users looking to maximize gains and minimize costs?

Many state-of-the-art deep neural networks are capable of few-shot learning. They are quick learners and can make large improvements with only a few hundred items of labeled data and only a few minutes of training time. You might have thought that deep neural network training requires millions of data items and a week of runtime, but we have shown in this article how that stereotype does not hold up to reality.
Because they can learn so much, so fast, from so little data, ROI drops quickly as you put more time and data into fine-tuning. In the experiments above, ROI shrinks by 70% from its highest value after 500 labeled data items or 600 added seconds of GPU training time. Further investment beyond a few hundred items of training data and very minimal training time may not pay off as well as you would like.

All in all, fine-tuning is not economically comparable to training from scratch. It is far, far cheaper, especially with the help of Finetuner. So the next time you receive a marketing email from the sales department of a GPU vendor or a company offering crowdsourced data acquisition, you know how to bargain with them.

Author

Han Xiao,Bo Wang,Scott Martens

Original Link

https://jina.ai/news/fine-tuning-with-low-budget-and-high-expectations/

References

https://rebrand.ly/jina-ai-finetune

A Guide to Using OpenTelemetry in Jina for Monitoring and Tracing Applications

guoliwu — Fri, 17 Feb 2023 05:45:17 +0000

As software (and the cloud) eats the world, more companies are using more microservice architectures, containerization, multi-cloud deployments and continuous deployment patterns. That means more points of failure. Failures, along with tight Service Level Objectives, stress out operations/SRE/DevOps teams, in turn increasing friction with development teams that want to deploy new features and launch new A/B tests as soon as possible.

Jina developers when we can’t deploy new features

CI/CD patterns have evolved a lot in the cloud era, helping teams to push improvements, fixes or new features to production almost instantaneously, giving users access to the latest goodies. One thing that enables this is the wide range of tools that both generate and collect information from running applications in real-time or near real-time.

This information is in the form of signals indicating an application's health, performance and conformity. You can observe and analyze signal anomalies to catch misbehaving applications or features that need to be patched or disabled until further analysis. If you do it properly, you can detect anomalies more quickly, meaning happier customers, prevention of major outages and fewer security leaks.

A disembodied hand coming from a screen like the girl from the Ring is another way to make customers happy

Back in the old days, signalling methods started as humble error/exception logging. They've now evolved to the latest OpenTelemetry standard.

In this post we’ll explore the new tracing and monitoring features introduced in Jina 3.12, and use Sentry to track what’s happening when indexing or searching.

What problems can monitoring and tracing solve?

We're going to use a fake real-world person to help explore the monitoring and tracing landscape: Meet Pamela NoLastName: She started a website called Pamazon to help people buy products online. Let's walk through the evolution of Pamazon.

Pamazon 1.0: Logs as `stdout`

As Pamela's site grows, she needs monitoring to generate, capture and analyze signals and improve site reliability. She starts with a pretty simple system:

Application logs are stored on each local machine and rotated periodically to avoid using up all the available disk space. If Pamela wants a longer retention period, she has to export these logs.
If a customer or tester notices an anomaly like the search engine constantly timing out, they create a support ticket to complain.

Pamela checks the logs and matches the time and possible error that the customer saw.

This is a cumbersome process, and root cause analysis doesn't target the customer’s actual experience. Pamela needs a better way to do things.

Pamazon 2.0: Structured and persistent logs

Luckily for Pamela, application logging has evolved a lot with logging formats, structured logging (JSON) and better error messages. There are tools integrating code performance style measurements into running applications to measure near real-time performance of pieces of code at various layers (networking, disk, CPU) of the application stack. Pamela can capture, visualize and store these signals for a longer time. Big data processing frameworks let her aggregate and crunch data from different application landscapes (languages, architectures, device platforms) and deployment environments.

Pamazon 3.0: Tracing errors with logs

Pamazon is becoming a big success. It has a cloud/hybrid deployment environment, dealing with millions of global users over multiple device types. This means Pamela needs a more nuanced approach for:

Generating valuable signals.
Targeting likely root causes of issues.

Take the users Alice and Bob for example:

Alice is in the USA using up her data connection to shop for fancy earrings on the Pamazon Android app.
Bob is shopping for a black jacket from a Chilean research base on an Ubuntu PC.

Will Alice and Bob have similar experiences? All answers point to no. Delivering localized search results to many different devices (not to mention different languages) worldwide is a complex and daunting endeavor. It requires precise ways to generate signals and target likely root causes of issues. More users and device types mean more and more ways for things to go wrong, and more root causes of issues.

Bob is starting to get angry at the constant errors. You won’t like him when he’s angry. Mostly because he mopes about it on Facebook ALL the time. Get out of my feed Bob!

In short, Pamela needs to up her game again.

What is OpenTelemetry?

OpenTelemetry is a project incubated by CNCF. It brings distributed tracing and metrics. Pamela has brought us on to implement this telemetry into her Jina Flow. Before getting started we have to learn a few new terms:

Telemetry is on-site collection of measurements or other data at remote points and automatically sending them to receiving equipment for monitoring. Simply put, you can consider a log message or raw observation value of the error count during a request as a measurement.
Instrumentation is the process of using instruments that record/collect raw observations that are transformed into signals and transmitted for monitoring purposes.
Tracing involves a specialized use of logging to record information about a program's execution.

Bear these in mind, as Pamela has roped us into doing this.

How to use OpenTelemetry integration in Jina?

Jina >=3.12 comes with built-in OpenTelemetry integrations and features. Let's see how it works for Pamela. We now outline how to build a text-to-image search system using the following components:

DocArray to manipulate data and interact with the storage backend using document store.
Jina Flow to orchestrate microservices for data loading, encoding and storage.
Jina Executor as the base to implement microservices.
OpenTelemetry Collector Contrib to collect traces from the microservices.
Sentry to visualise operations reported by the microservices.

💡In this post we’re just building out a backend, and not touching on a frontend. To build your own low-code backend+frontend neural search solution, check out Jina NOW.

Preparing the data

We derived the dataset by pre-processing the deepfashion dataset using Finetuner. The image label generated by Finetuner is extracted and formatted to produce the text attribute of each product.

Building a Flow with tracing enabled

💡You'll need Jina version ≥3.12.0 to use OpenTelemetry features.

We use a Jina Flow as a pipeline to connect our microservices (a.k.a Executors) together. Since we don't care too much about the nitty-gritty details, we won't dive into all the code, but just give a high-level overview. After all, the telemetry is the thing. Let's define the following in Python:

from jina import Flow, DocumentArray, Document

flow = (
    Flow(
        host='localhost',
        port=8080,
        tracing=True,
        traces_exporter_host='localhost',
        traces_exporter_port=4317,
    )
    .add(
        name='clip_encoder', # encode images/text into vectors
        host='localhost',
        port=51000,
        timeout_ready=3000000,
        uses_with={'name': 'ViT-B-32::openai'},
        external=True,
        tls=False,
    )
    .add(
        name='qdrant_indexer',
        uses='jinahub+docker://QdrantIndexer', # store vectors and metadata on disk
        uses_with={
            'collection_name': 'collection_name',
            'distance': 'cosine',
            'n_dim': 512,
        },
    )
)

with flow:
    flow.block()

💡This Flow uses pre-built Executors from Jina's Executor Hub, saving you time in writing code.

flow.plot('foo.svg') will give you this nice SVG image

The Flow is composed of:

The Gateway that manages request flows to the underlying microservices. The tracing arguments are provided to the Flow, enabling OpenTelemetry tracing features for each deployed microservice.
CLIP-as-service Executor that uses the default cpu torch runtime with the ViT-L-14-336::openai model for encoding text and/or image data. The clip_encoder service is run using an independent Flow with the required tracing arguments.
QdrantIndexer is the backend for storing and searching the dataset using Docarray. You'll need to provide the appropriate Qdrant host and port parameters if the database isn't running on localhost and the default port. The vector dimension is configured using the uses_with argument.

💡For more information on building a Flow read our docs.

Indexing and searching

Now that the components are in place, let's start adding data to our database, and then we can search for our favorite products using text. We can easily index our product images using the following code snippet:

from jina import DocumentArray, Client
da = DocumentArray.pull('deepfashion-text-preprocessed', show_progress=True)

# connect to the Flow
client = Client(host="grpc://0.0.0.0:8080")

# use only 100 Documents for demo
for docs in da[:100].batch(batch_size=10):
    client.post(on="/index", inputs=docs)

A sample search request looks like:

from jina import Client, DocumentArray, Document

client = Client(host='grpc://0.0.0.0:8080')
results = client.post('/search', DocumentArray(Document(text='jacket mens')))

💡For more information on Jina Client, check our docs.

Let’s break down the index and search processes described above:

clip_encoder generates an embedding for the text attribute of each product. The Flow(...).add(...).add(...) definition creates a sequential topography by default. Requests to the Flow first pass through the clip_encoder service and then results are passed to the qdrant_indexer service.
qdrant_indexer implements the /index endpoint for index operations and the /search endpoint for search operations. The operations work on embeddings without any focus on other attributes of a product.

Enabling tracing on Executors

Now is a good time to learn a few OpenTelemetry terms:

Tracing involves a specialized use of logging to record information about a program's execution.
A trace represents a series or parallel or combinations of operations that were involved in producing the end result.
Every trace is made up of one or more spans. Each span represents one operation in a trace, like process_docs, sanitize_text, or embed_text.

The Hub Executors in the above example have instrumentation integrated by default. Let’s look at a simple example of providing the span with useful tags for the /index operation. The below code snippet from the QdrantIndexer creates two spans:

Jina automatically creates the /index span as part of the @requests decorator. This is one of Jina's automatic instrumentation features, providing value out of the box.
You can track more fine-grained operations such as the qdrant_index span, which records the number of documents received in the request that must be indexed in Qdrant. Suspiciously quick indexing operations could be due to any empty request. On the flip side, very slow requests could be caused by too many large documents in the request. You can add more information to the span tags, such as the target Qdrant Collection and other deployment related information.

@requests(on='/index')
def index(self, docs: DocumentArray, tracing_context, **kwargs):
    """Index new documents
    :param docs: the Documents to index
    """
    with self.tracer.start_as_current_span(
        'qdrant_index', context=tracing_context
    ) as span:
        span.set_attribute('len_docs', len(docs))
        self._index.extend(docs)

Notice that the service.name attribute can be helpful if a single Qdrant cluster is used by different Flows to store different Documents.

Collecting and analyzing traces

Right now, Jina only supports the export mechanism to push telemetry signals to external systems. It uses OpenTelemetry Collector Contrib as the unified component to collect telemetry signals before exporting to enhanced components that transform the data for visualization and analysis. The collector setup is very basic and functions only as the uniform intermediary for collecting and exporting data.

We use the self-hosted Sentry application landscape to set up the actual APM or SPM. We'll only explore a small set of features supported by Sentry to preserve the focus of this post. Refer to the documentation for more details.

💡Application Performance Monitoring (APM) or the System Performance Monitoring (SPM) is the monitoring of performance and availability of applications. Service Level Indicators (SLI’s) detect and diagnose complex application performance problems to maintain an expected level of Service Level Objectives (SLO’s).

How to use Sentry to visualize and analyze the collected data?

Sentry has many features and custom definitions to translate telemetry signals into business terms. We'll just focus on the Performance, Transaction Summary and Trace, and Dashboard views, and using them to monitor the Flow and all Executors.

Performance view

The performance view gives the overall view of metric signals received by Sentry:

Click on the image to read the explanatory labels.

Summary view

You can click a span to view the transaction summary page. In our case, we clicked the /index span to bring up the index summary:

Click on the image to read the explanatory labels.

In the above screenshot:

The TRACE ID column shows the span’s parent trace.
The green section lists the spans that may be slow. By clicking the trace ID we can see the full trace of the operation.
The red section helps diagnose and drill down to the root cause of abnormalities, errors or issues produced during this operation. The duration and status of the span is used to detect suspects and display tags that are part of the suspect spans. We can add useful tags to span attributes to detect suspect spans and tags more easily.
The blue section shows the top tags in the selected time frame. This is a more general overview to gain quick insights into different attributes that you could use to drill down to the root cause of abnormalities.

Trace view

You can click a span’s event ID in the summary view to see its *********trace view*********. Let’s look at the trace view of a single /search operation:

In the figure above, the trace f1adb01bcb9fd18f59a5b38745f07e39 shows the end-to-end request flow with spans that were involved in generating the response. Span names are prefixed with rpc or default tags. These prefixes are determined by Sentry from the span tag attributes.

In this screenshot, Jina automatically creates the rpc spans which represent the internal requests flow. The rpc spans help to fill in the gaps and provide a complete view on top of the user added spans encode, inference, txt_minibatch_encoding, /search and qdrant_search.

Before drilling down into user-created spans, let’s look at the trace view in more depth. This is basically a directed acyclic graph connecting the dots from request beginning to end (left to right tracking of the bars). All these operations are sequential as seen by the top down rendering of the colored bars. Each operation's duration is displayed and is more useful if there are parallel operations or slow spans. To the left is the tree style (top down) representation of high level spans and spans nested under each operation. There are three top level spans because of the topology generated by the Flow.

The client request reaches the Gateway, and the Gateway triggers a call to the clip_encoder Executor
The Executor creates an embedding of the search text and forwards it to the search endpoint of the qdrant_indexer
The search endpoint retrieves products for the search query.

💡You can find more details and features such as displaying spans with errors in Sentry’s documentation.

A user can click any span for more information. We can click the clipencoder's' inference span to see more information. The inference span generates the embedding for the text attribute of Documents in the request:

The has_img_da and has_txt_da span attributes show if the document contained image and/or textual data.
The minibatch_size shows the batch size used to generate embeddings using the configured thread pool.

Since the demo dataset of 100 Documents provides only text data, we see has_txt_da is set to True and the next span contains only the txt_minibatch_encoding span wherein text embeddings were actually generated using the thread pool.

Customize your Sentry dashboard

Lastly, let's see a sample performance monitoring dashboard for our example Flow. The various telemetry signals are combined to give a single outlook of the various indicators using which we can discern discrepancies or abnormal behaviors.

Click on the image to read the explanatory labels.

💡Refer to the documentation for more details on how to customize the dashboard and the error analysis graphs provided by Sentry.

Conclusion

By using OpenTelemetry, we've helped Pamela build a reliable product search system for her shopping site. This means that Pamela can now get reports and/or alerts for errors as they happen. Tracing issues becomes much easier. leading to a quicker response time and faster turnaround time for fixing issues.

She now has more time to focus on improving the website, increasing the assortment or adding new search features for her customers. This means happier customers all round, more business, and improved profits for Pamela.

Yes, this really is Pamela flashing her OpenTelemetry-funded riches around. She’s tacky like that.

If you’re interested in making your Jina applications more observable and reliable with OpenTelemetry (and replicating Pamela's success), read our docs.

This week(s) in DocArray

guoliwu — Thu, 16 Feb 2023 08:55:25 +0000

💡Thanks to the DocArray team for this guest blog post!

It's already been two weeks since the last alpha release of DocArray v2. And since then a lot has happened — we've merged features we're really proud of, and we've cried tears of joy and misery trying to coerce Python into doing what we want. If you want to learn about interesting Python edge cases or follow the advancement of DocArray v2 development then you’ve come to the right place in this blog post!

For those who don’t know, DocArray is a library for representing, sending, and storing multi-modal data, with a focus on applications in ML and Neural Search.

👉 DocArray Link：https://rebrand.ly/devTo-docarray

The project just moved to the Linux foundation AI and Data, and to celebrate its first birthday we decided to rewrite it from scratch, mainly because of a design shift and a will to solidify the codebase from the ground up. Also because it can’t eat cake and we had to give it something.

So, what's been happening in the past two weeks?

Less verbose API

One of DocArray's goals is to give our users powerful abstractions to represent nested data. To do this in v2 we allow nesting of BaseDocument. (Well, this is actually just a feature of pydantic and one of the reasons its design seduces us to use it as a backend).

from docarray import BaseDocument
from docarray.documents import Image, Text

class MyBanner(BaseDocument):
    title: Text
    image: Image

class MyPoster(BaseDocument):
    left: MyBanner
    right: MyBanner

This is a powerful design pattern, but the API is a bit too verbose when using our predefined Document class:

banner_1 = MyBanner(title=Text(text='hello'), image=Image(url='myimage.png'))
banner_2 = MyBanner(title=Text(text='bye bye'), image=Image(url='myimage2.png'))

poster = MyPoster(left=banner_1, right=banner_2)

The new API looks like this:

banner_1 = MyBanner(title='hello', image='myimage.png')
banner_2 = MyBanner(title='bye bye', image='myimage2.png')

poster = MyPoster(left=banner_1, right=banner_2)

It's waaay less verbose. We basically override pydantic's predefined document validator to let us do this smart casting. But we didn't make this automatic, in the sense that if you create a Document you still need to use the verbose API. This is because this casting isn't always obvious. For instance, look at this Document:

class MyDoc(BaseDocument):
   title: str
   description: str

doc = MyDoc('hello') # won't work

n this case, where should 'hello' be assigned? Title or description? There's no obvious way to do it so we'd rather let the user define it, at least until we find a better way.

We're thinking about either:

Referring to the order and make the first string in the list the “main” one. But this is against one of the core values of this rewrite: “we don’t do things implicitly”.
Allowing the user to mark a "main" field somehow, either with a Field object or a function.

From the outside, it looks like a minor problem. But we believe the real devil is in the details, so we spent countless hours arguing over such a simple API. Man, that's time we won't get back. 💁‍♂️

Curious? Check out this PR:

👉 DocArray PR：https://rebrand.ly/docarray-PR

`__torch_function__` , or: How to give PyTorch a little bit more confidence

We had a lot of fun wrapping our heads around the __torch_function__ concept.

Our TorchTensor class is a subclass of torch.Tensor that injects some useful functionality (mainly the ability to express its shape at the type level: TorchTensor[3, 224, 224], and protobuf serialization), and PyTorch comes with a whole machinery around subclassing, dynamic dispatch and all that jazz.

One part of this machinery is __torch_function__ , a magic method that allows all kinds of objects to be treated like Torch Tensors. You want instances of your class to be able to be processed by functions like torch.stack([your_instance, another_instance]), or be directly added to a torch.Tensor? No problem, just implement __torch_function__ in your class, handle it there, and off you go! No need to even subclass torch.Tensor:

import torch

class MyClass:
    def __init__(self, others=None):
        self._others = others or []

    @classmethod
    def __torch_function__(cls, func, types, args=(), kwargs=None):
        if func is torch.stack or func is torch.Tensor.add:
            # we know how to handle these!
            return cls.combine(args)
        else:
            # ... but are clueless about the rest
            return NotImplemented

    @classmethod
    def combine(cls, *others):
        return cls(others=list(others))

print(torch.stack([MyClass(), MyClass()]))
# outputs:
# <__main__.MyClass object at 0x7fd290c55190>
print(torch.rand(3, 4, 5) + MyClass())
# outputs:
# <__main__.MyClass object at 0x7f363e2ed0d0>

Now, the example above isn’t a very useful one, but you get the idea: __torch_function__ lets you create objects that behave like Torch Tensors without being Torch Tensors.

But hold on. Instances of TorchTensor are Torch Tensors, since they directly inherit from torch.Tensor! So all the functionality is already there, we inherit __torch_function__ from torch.Tensor, and we don’t need to care about any of this, right?

Well, not quite.

The thing is, we don’t just have one subclass of torch.Tensor; we have many: TorchTensor is the obvious one, but there's also TorchTensor[3, 224, 224], TorchTensor[128] and TorchTensor['batch', 'c', 'w', 'h'], etc. All of these are separate classes!

To be a bit more precise, all the parameterized classes (the ones with [...] at the end) are direct subclasses of TorchTensor and are siblings of one another (this becomes important later on).

                                    torch.Tensor
                                         ^
                                         |
       ---------------------------> TorchTensor <------
      ^                   ^                            ^
      |                   |           ....             |
TorchTensor[128] TorchTensor[1, 128]  ....   TorchTensor['batch', 'c', 'w', 'h']

So where's the problem?

The problem essentially lies in the types argument to __torch_function__. It contains the types of all the arguments that were passed to the original PyTorch function call, torch.stack() in the example above. Again, in the stack example above, this would just be the tuple (MyClass, MyClass).

This is meant just as a convenience to the implementer of __torch_function__. It lets them quickly decide, based on the type, if they can handle a given input or not.

Let’s take a look at how the default PyTorch (torch.Tensor) implementation of __torch_function__ makes that decision:

@classmethod
def __torch_function__(cls, func, types, args=(), kwargs=None):
    # ... some stuff here
    if not all(issubclass(cls, t) for t in types):
        return NotImplemented
    # ... more stuff here

Can you already guess where things go wrong?

Let me give you a hint by showing a failure case:

data = torch.rand(128)
print(TorchTensor[128](data) + TorchTensor[1, 128](data))

When this call is handled in __torch_function__ , as inherited from torch.Tensor, cls will be TorchTensor[128] and types will contain TorchTensor[1, 128].

That makes sense: those are the two classes involved in this addition.

But what will PyTorch do?

It will throw up its hands and give up!

TypeError: unsupported operand type(s) for +: 'TorchTensor[128]' and 'TorchTensor[1, 128]'

TorchTensor[128] is not a subclass of TorchTensor[1, 128]; they're siblings! So the subclass check above will fail and PyTorch will announce that it has absolutely no clue about how to combine instances of these two classes.

But c'mon PyTorch! Both these classes inherit from torch.Tensor! Believe in yourself, you do know how to deal with them! Just treat them like normal tensors!

And that’s already the solution to the entire problem: We need to give PyTorch a little confidence boost, by telling it to treat our custom classes just like the torch.Tensor class it already knows and loves.

So how do we give it this metaphorical pep talk? It’s actually quite simple:

@classmethod
def __torch_function__(cls, func, types, args=(), kwargs=None):
    # this tells torch to treat all of our custom tensors just like
    # torch.Tensor's. Otherwise, torch will complain that it doesn't
    # know how to handle our custom tensor type.
    docarray_torch_tensors = TorchTensor.__subclasses__()
    types_ = tuple(
        torch.Tensor if t in docarray_torch_tensors else t for t in types
    )
    return super().__torch_function__(func, types_, args, kwargs)

This is the implementation of of __torch_function__ that currently powers TorchTensor. It does just one thing: For any class that's a subclass of TorchTensor, it changes the types argument before passing it along to the default implementation of __torch_function__. It substitutes all such types for torch.Tensor, telling PyTorch that it's got this!

Et voilà, it works:

data = torch.rand(128)
print(TorchTensor[128](data) + TorchTensor[1, 128](data))
# outputs:
# TorchTensor[128]([0.0454, 1.3724, ..., 1.3329, 0.9239,])

This PR demonstrates how we coached PyTorch into having a little more self-esteem and being it's truest, best self:

👉 https://github.com/docarray/docarray/pull/1037/files

Early support for DocArray v2 in Jina

Well, it's not exactly a new feature, but we've been working on early support for DocArray v2 in Jina.

DocArray’s relation to Jina is similar to pydantic’s relation to FastAPI:

FastAPI is an HTTP framework that uses pydantic models to define the API schema.
Jina is a gRPC/HTTP framework that uses DocArray Documents to define the API schema.

There are other conceptual differences of course, but to fully understand the new changes in Jina it's interesting to look at it like this. DocArray is actually built on top of pydantic and adds a hint of multi-modal machine learning on top of that.

Here's an example of the new interface:

from jina import Executor, requests
from docarray import BaseDocument, DocumentArray
from docarray.documents import Image
from docarray.typing import AnyTensor

import numpy as np

class InputDoc(BaseDocument):
    img: Image

class OutputDoc(BaseDocument):
    embedding: AnyTensor

class MyExec(Executor):
    @requests(on='/bar')
    def bar(
        self, docs: DocumentArray[InputDoc], **kwargs
    ) -> DocumentArray[OutputDoc]:
        docs_return = DocumentArray[OutputDoc](
            [OutputDoc(embedding=np.zeros((100, 1))) for _ in range(len(docs))]
        )
        return docs_return

The main difference is that an Executor doesn't necessarily do in-place modification, but can return a different Document type. For instance, we have a toy encoder that takes an image as input and returns embeddings. Similar to FastAPI, we infer the input and output schema of the Executor by inspecting the type hint of the method. You can also use this information as an argument if you don’t want to rely on the type hint.

💡Check the v2 docs for more information

Here's the PR:

👉 https://rebrand.ly/docarrayV2-PR

Pretty printing

We ported back the pretty printing from DocArray v1 to v2 and tidied it up a bit to reflect the new v2 schema! Under the hood, we're relying on the awesome rich library for everything related to UI.

Check the PR for more info!
👉https://rebrand.ly/docarrayV2-Pretty-printing

Document Stores

We’re currently completely rethinking Document Stores. The main points are:

Every Document Store will have a schema assigned, just like a DocumentArray, but with more (backend-dependent) options and configurations.
First-class support for hybrid search and multi-vector search.
Support search on nested Documents.

If you are curious about the full (preliminary) design you can check it in detail out here. But here's a small taster:

# define schema
class MyDoc(BaseDocument):
    url: ImageUrl
    tensor: TorchTensor[128]

da = DocumentArray[MyDoc](...)  # data to index

store = DocumentStore[MyDoc](storage='MyFavDB', ...)

# index data
store.index(da)

# search through query builder
with store.query_builder() as q:
    # build complex (composite) query
    q.find(torch.tensor(...), field='image', weight=0.3)
    q.find(torch.tensor(...), field='description')
    q.filter("price < 200")
    q.text_search('jeans', field='title')

results = store.execute_query(q)

Beyond the first designs that are just now finding their way into actual code, we're happy to share that we're closely collaborating with Weaviate to make our Document Stores as good as they can be!

So far they’ve provided a lot of valuable input for our designs, and we’re looking forward to the collaboration during actual implementation.

Lastly, a word about Document Store launch plans: Our current plan is to launch this reincarnation of Document Stores with three supported backends: Weaviate, ElasticSearch, and one on-device vector search library (which one? That's still TBD).

Unfortunately our capacity doesn't allow for more on launch day, but if you (yes, you!) want to help us accelerate development for one of the other vector databases, we would absolutely love that and accelerate our timelines accordingly. If you feel intrigued, reach out to us on Discord!

Author

Johannes Messner,Alex C-G,Sami Jaghouar

Original Link

https://jina.ai/news/this-week-in-docarray-1/

How to Personalize Stable Diffusion for ALL the Things

guoliwu — Wed, 15 Feb 2023 06:33:41 +0000

Jina AI is really into generative AI. It started out with DALL·E Flow, swiftly followed by DiscoArt. And then…🦗🦗<cricket sounds*>*🦗🦗. At least for a while…

That while has ended. We’re Back In the Game, baby. Big time, with our new BIG metamodel. You might be wondering: What’s so meta about it? Before you needed multiple Stable Diffusion models, but now with BIG you have one model for everything.

BIG metamodel:https://rebrand.ly/Big-MetaModel

BIG stands for Back In (the) Game. If we ever get out of the game and get Back At the Game Again, we’ll have to go with BAGA. I hope we can afford the baseball caps.

In short, BIG lets you fine-tune Stable Diffusion to the next level, letting you create images of multiple subjects and in any style you want. That means you can take a picture of you and a picture of your pooch and combine them into a composite image in the style of Picasso, Pixar or pop art.

We created BIG by taking the DreamBooth paper, which allows fine-tuning with one subject, and leveling it up it into a metamodel to learn multiple new objects without using up all your compute. In this blog post we’ll go over how we did that, and how well it works.

But first, let’s take a quick look at how we got here, by starting off with Stable Diffusion itself.

Stable Diffusion: Fine-tune to your favorite artist (but forget everyone else)

In the beginning there was Stable Diffusion and it was good. “Create a Banksy picture” you would say, and verily a Banksy would be created. “Create an artwork in the style of Picasso” you would exclaim. And verily an image of a woman with too many angles would be created.

“Generate an image in the style of Leon Löwentraut” you proclaim. And Stable Diffusion did say “uh, what? lol I’ll give it my best. And verily it was rubbish.

Luckily, this can be fixed by fine-tuning (yeah, we’re dropping the Biblical speak). If you feed Stable Diffusion a Leon Löwentraut image it can learn his style (using, for example, text-to-image fine-tuning for Stable Diffusion.)

Left: Generated image by Stable Diffusionwith prompt a banksy painting, before fine-tuning; Right: Generated images for prompt a banksy painting, after fine-tuning to Löwentraut.

The only problem is it gets amnesia for everything else it’s learned before. So if you then try to create the style of Banksy or Picasso on your newly fine-tuned model they all turn out pretty Löwentrautian:

Left: An actual painting from Leon Löwentraut, Right: Generated image for Leon Löwentraut.

Left: Generated image by Stable Diffusion with prompt a picasso painting, before fine-tuning; Right: Generated image for prompt a picasso painting, after fine-tuning to Löwentraut.

DreamBooth: Fine-tune to your favorite artist (and remember!)

DreamBooth fixes that. At least to a point. You want to train it for your dog? Piece of cake. Fine-tune it for your favorite artist? A walk in the park. And Mona Lisa would still look like it came from Leonardo and Starry Night from Van Gogh.

It does this by extending Stable Diffusion’s fine-tuning loss with a prior preservation loss to train the model to still generate diverse images for the category of that style (e.g. Löwentraut) or object (e.g. a dog). The prior preservation loss is the mean squared error of the now generated images and the pre-training generated images for the category in the latent space.

This fine-tuning involves two prompts:

a [CATEGORY]: The prompt for the prior preservation loss is the category of the style or object in question, like dog or painting.
a [RARE_IDENTIFIER] [CATEGORY]: The prompt for fine-tuning to a new object or style, generally a string that corresponds to a token the model is unfamiliar with. This is a unique reference to the object you want Stable Diffusion to learn. Example strings would be sks or btb.

So, to fine-tune Stable Diffusion to create images of your dog, you would:

Take 5-8 high quality images of your dog.
Fine-tune the model to recreate these images for the prompt a sks dog and the same time still create diverse images for the prompt a dog.

Creating diverse images is helped along the way by generating images for the prompt a dog and using them as training images.

Unstable Confusion: The amnesia creeps back in

So far, so good! But what if you first use DreamBooth to fine-tune Stable Diffusion on your dog, then train on Leon Löwentraut, then ask it to create a picture of your dog in his style? Or train for artist_1, then train for artist_2, then try to create a new image by artist_1?

That shouldn’t be too hard, right?

Too hard for DreamBooth unfortunately.

DreamBooth falls over on this because it has a selective short-term memory. Using it to teach Stable Diffusion something new (let’s say the style of Löwentraut) works great. And then you can create images of all kinds of places and objects (already known to Stable Diffusion) in his style. But then if you decide to train it in the style of another artist it’ll forget everything it learned about Löwentraut’s style.

That’s why we created BIG: To let you train on multiple objects and styles without the amnesia. But more on that in a later section.

To see DreamBooth’s amnesia in action, let’s use it to fine-tune a model for two different artists:

Leon Löwentraut (using the RARE_IDENTIFIER of lnl)
Vexx (using the RARE_IDENTIFIER of qzq)

Left: Another example of an actual Leon Löwentraut painting; Right: An image from Vexx.

To generate a painting in one of the above styles, we’d use a prompt like a lnl painting or a qzq painting.

Using DreamBooth and the prompt a lnl painting to fine-tune a model to fit the art style of Leon Löwentraut works great. For this we used four training images and trained for 400 steps with a learning rate of 1e-6.

The left and center images show the model before fine-tuning. Note how it doesn’t know either lnl or loewentr

Left: Generated image by Stable Diffusion with prompt a qzq painting; Center: Generated image by Stable Diffusion with prompt a vexx painting; Right: Generated image for a qzq painting after fine-tuning for Vexx.

But can the model still produce images in the styles of Picasso, Matisse and Banksy? Yes!

Left: Generated image for prompt a banksy painting after fine-tuning for Vexx; Center: Generated image for prompt a mattisse painting after fine-tuning for Vexx; Left: Generated image for prompt a picasso painting after fine-tuning for Vexx.

ow, after learning Vexx, does our model still remember Leon Löwentraut?

Generated image for a lnl painting after fine-tuning for Vexx.

How can we cure DreamBooth’s amnesia?

To solve the problem of forgetting Leon Löwentraut while learning Vexx, we included the images of Leon Löwentraut in the prior preservation loss during the fine-tuning. This is equivalent to further fine-tuning on Löwentraut while fine-tuning on Vexx, but less so than the original fine-tuning of Löwentraut. It works best to reuse the actual images for the style than the images the model generated.

So, now we can generate all the artists we encounter in our travels. After teaching Stable Diffusion model to create images in the style of Leon Löwentraut from the prompt a lnl painting, we wanted to create images of our favourite mate tea bottle. So, again we used the Leon Löwentraut fine-tuned model as initialization to train my Stable Diffusion model to create images of Mio Mio Mate for a sks bottle (giving it a unique RARE_IDENTIFIER).

Left: Picture taken of a mio mio mate bottle, used for fine-tuning; Center: Generated image by Stable Diffusion with prompt: a mio mio mate bottle; Right: Generated image by Stable Diffusion with prompt: a sks bottle.

Again, this works for the new object, yet the model doesn’t quite remember how to produce images for Leon Löwentraut under a lnl painting.

Left: Generated image for a sks bottle after fine-tuning for Mio Mio bottle; Right: Generated image for a lnl painting after fine-tuning for Mio Mio bottle.

To solve this issue, we can also think about using the previous images for Leon Löwentraut in the prior preservation loss. This helps remember the style of Leon Löwentraut. Yet, art styles which have similar geometric features (like Picasso) are not as accurately reproduced. This makes intuitive sense and is also the reason DreamBooth was introduced in the first place. Building on it, we need to not only incorporate the images of Leon Löwentraut in the prior preservation loss but also images of paintings, i.e., incorporate additionally the previous objects/styles and their categories into the prior preservation loss.

BIG Metamodel: Fine-tune Stable Diffusion to your favorite artist and dog

Now piecing together the above ideas raises the question: how do we split the images for the prior preservation loss as a batch that always consists of N instance images and N prior preservation loss images? Well, the following intuitive split works great:

Use half of the prior preservation loss image for the current category and its previously learned instances
- 50% of those are generated images for the category
- Remaining 50% equally divided among previously used instance images
Use the other half equally among the previously trained categories; so for every category, split available images into:
- 50% generated images for category (prompt is e.g. a painting)
- Remaining 50% equally divided among previously used instance images

To illustrate this, let’s assume we have a metamodel which has learnt:

Two objects for category bottle
Two for dog
One for painting

To learn another dog the top-level split between the categories and the for the individual categories are as follows:

Visualizing it as a pie chart, this is the split of all images for the prior preservation loss:

To abstract that logic away, we created an Executor to quickly fine-tune private (i.e. owned by a specific user) models for specific objects/styles, as well as create public and private metamodels. To do that it exposes the following endpoints:

/finetune endpoint:
- Fine-tunes a model for a particular style or object which is private (private model)
- Incrementally fine-tunes a model for various styles and objects which is only accessible by particular user (private metamodel)
- Incrementally fine-tunes a model for various styles and objects which is accessible for everyone and to which everyone can contribute (metamodel)
/generate endpoint:
- Generates images for any of above models as well as for a pretrained model

The Hitchhiker's Guide to Building Your Own Metamodel

So how do you train your metamodel?

First, fit a private model to find the right learning rate and training steps. A low learning rate of 1e-6 is best across different styles and objects. We found that starting from 200 training steps and slowly increasing to 600 is best to find the sweet spot of fitting and not overfitting for objects and styles. To recreate faces, we recommend starting from 600 training steps and increasing to 1,200.

The second and final step is to reuse the same parameters for the request but change the target_model to meta or private_meta.

Now you have your (private) metamodel. In a script, fine-tuning is made very simple as shown below:

from jina import Client, DocumentArray 
import hubble 
# specify the path to the images 
path_to_instance_images = '/path/to/instance/images' 
# specify the category of the images, this could be e.g. 'painting', 'dog', 'bottle', etc. 
category = 'category of the objects' 
# 'private' for training private model from pretrained model, 'meta' for training metamodel 
target_model = 'private' 


# some custom parameters for the training 
max_train_steps = 300 
learning_rate = 1e-6 
docs = DocumentArray.from_files(f'{path_to_instance_images}/**') 

for doc in docs: 
    doc.load_uri_to_blob() 
    doc.uri = None client = 

Client(host='grpc://host_big_executor:port_big_executor') 

identifier\_doc = client.post( on='/finetune', inputs=docs, parameters={ 'jwt': { 'token': hubble.get\_token(), }, 'category': category, 'target\_model': target\_model, 'learning\_rate': learning\_rate, 'max\_train\_steps': max\_train\_steps, }, ) print(f'Finetuning was successful. The identifier for the object is "{identifier\_doc\[0\].text}"')

Results

With our new metamodel we taught Stable Diffusion to create images of the Mio Mio Mate tea bottle, a sparking water bottle from a local manufacturer, a NuPhy Air75 keyboard, and an office desk chair and artwork in the styles of Leon Löwentraut and Vexx:

Note how for both bottles a hand appears holding it. For both generated images there were six images used for fine-tuning and for both there was only one image holding the bottle by hand. Yet, the model has somewhat collapsed to always showing this hand. This shows the importance of not only high-quality but also diverse representative images. Here are the images of a hand holding the bottle that we used for fine-tuning:

Left: Picture with hand used to fine-tune Stable Diffusion to sparking water bottle; Right: Picture with hand used to fine-tune Stable Diffusion to mate bottle.

The model isn’t just able to memorize the objects, but also learns how newly-learned objects and styles interact:

Left: Sparkling water bottle in the style of Vexx, generated with prompt: a qzq painting of a rvt bottle; Right: Mate bottle in Leon Löwentraut's style, generated with prompt: a lnl painting of a pyr bottle.

Left: NuPhy Air75 keyboard in style of Vexx, generated with prompt: a qzq painting of a sph keyboard; Right: NuPhy Air75 keyboard in style of Leon Löwentraut, generated with prompt: a lnl painting of a sph keyboard.

Last, but not least we trained it to create images of Joschka's company dog, Briscoe:

What’s next?

In future, it would be interesting to enhance the performance if we additionally apply Textual Inversion to get better prompts for generating new images. This might also change how previously-learned objects are forgotten.

We could also explore other angles, like why previously learned objects and styles get overwritten, by understanding if the similarity in prompts is an issue or if semantic similarity of the new objects is a strong predictor of forgetting. The former can be solved by better sampling of the rare identifiers, using BLIP to automatically generate captions, or adapting textual inversion to incorporate the forgetting effect.

Another question is when the current form of further learning in the metamodel leads to overfitting to previously-learned objects and styles as the model is continuously trained to minimize the loss of the generated images for them. For that it’s relevant to optimize the allocation of images for the prior preservation loss in order to push the amount of new learnt objects even further.

You can also start playing around with it yourself in Google Colab or check out the GitHub repository.

Author

Joschka Braun,Alex C-G

Orignal Link

https://jina.ai/news/how-to-personalize-stable-diffusion-for-all-the-things/

Improving Search Quality for Non-English Queries with Fine-tuned Multilingual CLIP Models

guoliwu — Fri, 10 Feb 2023 13:56:26 +0000

Since early 2021, CLIP-style models have been the backbone of multimodal AI. They work by embedding inputs from more than one kind of media into a common high-dimensional vector space, using different models for different modalities. These different models are co-trained with multimodal data. For CLIP-models, this means images with captions.

A highly schematic representation of how CLIP embeddings make it possible to associate texts with images.

The result? A pair of models that embed images and texts close to each other if the text is descriptive of the image, or the image contains things that match the text. So if we have a picture of a skirt and the word “Rock” (German for “skirt”), they would be close together, while the word “Hemd” (German for “shirt”) would be closer to a picture of a shirt.

Towards multilingual CLIP

However, CLIP text models have mostly been trained on English data, and that’s a big problem: The world is full of people who don’t speak English.

Very recently, a few non-English and multilingual CLIP models have appeared, using various sources of training data. In this article, we’ll evaluate a multilingual CLIP model’s performance in a language other than English, and show how you can improve it even further using Jina AI’s Finetuner.

To make this happen, we’re collaborating with Toloka, a leading provider of data procurement services for machine learning, to create a dataset of images with high-quality German-language descriptions written by humans.

How does multilingual CLIP work?

Multilingual CLIP is any CLIP model trained with more than one language. So that could be English+French, German+English, or even Klingon+Elvish.

We’re going to look at a model that Open AI has trained with a broad multilingual dataset: The [xlm-roberta-base-ViT-B-32](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k) CLIP model, which uses the [ViT-B/32](https://github.com/google-research/vision_transformer)image encoder, and the [XLM-RoBERTa](https://huggingface.co/xlm-roberta-large) multilingual language model. Both of these are pre-trained:

ViT-B/32, using the ImageNet-21k dataset
XLM-RoBERTa, using a multi-terabyte dataset of text from the Common Crawl, containing over 100 languages.

So, from the outset, multilingual CLIP is different because it uses a multilingual text encoder, but can (and generally does) use the same image encoders as monolingual models.

Open AI then co-trained the two encoders with the multilingual laion5b dataset, which contains 5.85 billion image-text pairs: 2.2 billion of these pairs are labelled in 100+ non-English languages, with the rest in English or containing text that can’t be nailed down to any one language (like place names or other proper nouns). These are taken from a sampling of images and their HTML alt-text in the Common Crawl web archive.

Some browsers will let you see the alt-text if you move your mouse over an image.

How an alt-text is encoded in HTML.

This dataset isn’t balanced in the sense that no-one has tried to ensure that data for one language is comparable in size or scope to the data for any other. English still dominates.

Deep dive of the tokenizer inside multilingual models

So, how is a multilingual text encoder different from a bog-standard monolingual one? One big difference is how it handles tokenization.

Text transformer models like XLM-RoBERTa all start by tokenizing input texts — breaking them up into smaller parts — and replacing each part with an input vector constructed as part of the initial training. These input vectors are strung together and passed to the model to create an embedding vector.

You might expect these smaller parts to match words, and sometimes they do. But looking for words by just checking for spaces and punctuation doesn’t capture the fact that call, calls, calling, and called are not four totally different words, just like small, smaller, and smallest, or annoy, annoyed, annoyingly. In practice, this entire class of model uses, at least partly, a technique called subword tokenization, which uses the statistical properties of sequences of characters to decide what units are the “right-size” for learning.

It’s not really based in any linguistic theory, but doing it this way has many advantages for machine learning. Think of the suffix -ed in English. You might expect that a “right-sized” statistical tokenizer would notice that many English words end in -ed, and break those words into two parts:

called → call -ed

asked → ask -ed

worked → work -ed

And this makes sense, most of the time. But not always:

weed → we -ed

bed → b -ed

seed → se -ed

Large language models are very robust, and they can learn that “weed” has a meaning different from “we” + “-ed”. Using this kind of tokenization, even new words that were never part of the pre-training data get a distinct representation for the model to learn.

Nonetheless, the more that the tokenization matches meaningful units of language, the faster and better the model learns.

Let’s take a concrete example. The image below is from the data provided by Toloka with the German caption “Leichte Damenjacke Frühling Herbst braun” (”Light women's jacket spring autumn brown”):

If we pass this German phrase to XLM-RoBERTa’s tokenizer, we get a very different result from when we pass it to a comparable tokenizer used for an English-only model:

multilingual-CLIP: leicht|e|Damen|jack|e|Frühling|Herbst|bra|un

english-only-CLIP: le|ich|te|dam|en|jac|ke|fr|ü|h|ling|her|bst|braun

The tokens found by the multilingual tokenizer much more closely match our intuitions about meaningful units in German, while the English-only-trained tokenizer produces almost random chunks. Yes, it is still possible for a large language model to learn from badly tokenized data, if it’s consistent, but it will be slower and/or less accurate.

In contrast, the English equivalent — a word-for-word translation — is clearly better tokenized by the English-only tokenizer, but is not so badly tokenized by the multilingual one:

multilingual-CLIP: light|women|'|s|ja|cket|spring|a|utum|n|brown

english-only-CLIP: light|women|'s|jacket|spring|autumn|brown

Even from the first step in the process of producing text embeddings, we can see that multilingual language models make a large difference in producing multilingual CLIP models.

Multilingual vs. monolingual CLIP on the search quality

Large language models are famously good at transfer learning. For example, if a monolingual English-only CLIP model has learned what “jacket” means, you can further train it, with very few additional examples, to know that the German word “Jacke” means the same thing. Then, it can carry all its knowledge about the English word “jacket” over to German.

It is possible that a model already trained on English could be retrained for German with less data than training a new German model from scratch.

Therefore, it’s worth asking: How much do we really gain using a model trained to be multilingual from the outset?

In this article, we will use the German fashion dataset provided by Toloka to:

Compare the zero-shot performance (i.e. out-of-the-box, without fine-tuning) of the multilingual CLIP model xlm-roberta-base-ViT-B-32 and the English-only equivalent clip-vit-base-patch32. These two use the same image embedding model, but different text embedding models.
Attempt to improve both models by using a part of the German dataset to fine-tune them.
Compare the fine-tuned models using the same metrics, so we can both contrast non-fine-tuned and fine-tuned models, and contrast the English-only and multilingual models after adaptation to the German data.
Show how much advantage, if any, is gained from a multilingual CLIP model.

Experiment Setup

The German Fashion12k dataset

We have collaborated with Toloka to curate a 12,000 item dataset of fashion images drawn from e-commerce websites, to which human annotators have added descriptive captions in German. Toloka has made the data available to the public on GitHub, but you can also download it from Jina directly in DocArray format by following the instructions in the next section.

The images are a subset of the xthan/fashion-200k dataset, and we have commissioned their human annotations via Toloka’s crowdsourcing platform. Annotations were made in two steps. First, Toloka passed the 12,000 images to annotators in their large international user community, who added descriptive captions.

The app prompted users to write descriptions that follow a common pattern, partially enforced by a simple pattern matcher. Specifically:

Write a search query that would find this product: type, your guess about the material, where it might be worn, color, texture, details. […]

Requirements for the query:

· At least SIX words

· Words that are separated ONLY by spaces (or ", ")

· Do NOT use "this is/these are"

Then, in the second stage, other, randomly chosen users validated each description.

Some examples from the resulting dataset:

Of the 12,000 image-text pairs in the data from Toloka, we randomly selected 10,000 for training and held the remaining 2,000 out for evaluation. By coincidence, and because some clothes are similar enough in nature, there are a few duplicate descriptions. However, since there are 11,582 unique descriptions, we didn’t consider this an important factor in using this data.

Download the dataset via DocArray

The German Fashion12k dataset is available for free use by the Jina AI community. After logging into Jina AI Cloud, you can download it directly in DocArray format:

train_data = DocumentArray.pull('DE-Fashion-Image-Text-Multimodal-train', show_progress=True)
eval_data = DocumentArray.pull('DE-Fashion-Image-Text-Multimodal-test', show_progress=True)

Load the multilingual CLIP model

Because CLIP models are actually two different models that have been trained together, we have to load them as two models.

In this article, we will use the Finetuner interface. To use the xlm-roberta-base-ViT-B-32 CLIP model:

from finetuner import build_model

mCLIP_text_model = build_model(
    name = 'xlm-roberta-base-ViT-B-32::laion5b_s13b_b90k',
    select_model = 'clip-text',
)

mCLIP_vision_model = build_model(
    name = 'xlm-roberta-base-ViT-B-32::laion5b_s13b_b90k',
    select_model = 'clip-vision'
)

For models supported directly by Jina AI, you can load them by name, without having to directly deal with downloading or deserialization.

Load the English CLIP model

For comparison, you can access the English-only ViT-B-32::openai CLIP model in the same way:

from finetuner import build_model

enCLIP_text_model = build_model(
    name = 'ViT-B-32::openai',
    select_model = 'clip-text',
)

enCLIP_vision_model = build_model(
    name = 'ViT-B-32::openai',
    select_model = 'clip-vision'
)

Evaluate the zero-shot performance

We measured the zero-shot performance of both the Multilingual CLIP model and the English-only one on German Fashion dataset, that is to say, how well they perform as downloaded, without additional training, on the 2,000 items we held out for evaluation.

We embed the text descriptions in the evaluation data, and used them to search for matches among the embedded images in the evaluation data, taking the 20 top matches for each text description. We took the results and performed a number of standard statistical tests on them, including Mean Reciprocal Rank (mRR), Mean Average Precision (mAP), Discounted Cumulative Gain (DCG), and the share of queries that return the exact image whose description matches the query (labeled “Hits”).

The performance results are:

Not very surprisingly, the English CLIP model performed extremely poorly on German data. Below are three examples from the evaluation set of queries in German, and the images it found to match:

Obviously, even though German is a relatively small part of the training set of the multilingual model, that is more than enough to make a ten-fold difference in performance with German queries, improving the value of a CLIP model from basically none to mediocre.

Improve the search quality via fine-tuning

One of the main insights of large-model neural-network engineering is that it’s easier to start with models that are trained on general-purpose data and then further train them on domain-specific data, than it is to train models on domain-specific data from scratch. This process is called “fine-tuning” and it can provide very significant performance improvements over using models like CLIP *as is*.

Fine-tuning can be a tricky process, and gains are highly dependent on the domain and the dataset used for further training.

Specify hyperparameters

Fine-tuning requires a selection of hyperparameters that require some understanding of deep learning processes, and a full discussion of hyperparameter selection is beyond the scope of this article. We used the following values, based on empirical practice working with CLIP models:

These hyperparameters are part of the command below.

Specify the evaluation data

We fine-tuned using the data split described previously: 10,000 items were used as training data, and 2,000 as evaluation data. In order to evaluate models at the end of each training epoch, we turned the evaluation data into a “query” and “index” dataset. The “query” data consists of the German text descriptions in the evaluation data, and the “index” data contains the images.

for doc in eval_data:
    tag = doc.chunks[1].text
    doc.chunks[0].tags['finetuner_label'] = tag
    doc.chunks[1].tags['finetuner_label'] = tag

query_data = DocumentArray([doc.chunks[0] for doc in eval_data])
index_data = DocumentArray([doc.chunks[1] for doc in eval_data])

These are also passed to the fine-tuning command.

Put everything together in one call

Running the command below uploads the training and evaluation data and fine-tunes the xlm-roberta-base-ViT-B-32 model on Jina AI Cloud:

import finetuner

run = finetuner.fit(
    model='xlm-roberta-base-ViT-B-32::laion5b_s13b_b90k',
    train_data=toloka_train_data,
    eval_data=toloka_eval_data,
    epochs=5,
    learning_rate=1e-6,
    loss='CLIPLoss',
    device='cuda',
        callbacks=[
        EvaluationCallback(
            query_data=toloka_query_data,
            index_data=toloka_index_data,
            model='clip-text',
            index_model='clip-vision'
        ),
        WandBLogger(),
    ]
)

The fine-tuning process may take a considerable length of time, depending on the model and the amount of data. For this dataset and models, it took roughly half an hour. But once fine-tuning is complete, we can compare the different models' performance at querying.

Qualitative study on fine-tuned models

For example, here are the top four results for the query “Spitzen-Midirock Teilfutter Schwarz” (”Lace midi skirt partial lining black”):

This kind of qualitative analysis gives us a sense for how fine-tuning improves the model’s performance. Before tuning, the model was able to return images of skirts that matched the description, but it also returned images of different items of clothing made of the same materials. It was insufficiently attentive to the most important part of the query.

After fine-tuning, this query consistently returns skirts, and all four results match the description. That is not to say that every query returns only correct matches, but that on direct inspection we can see that it has a far better understanding of what the query is asking for.

Quantitative study on fine-tuned models
To make more concrete comparisons, we need to evaluate our models in a more formal way over a collection of test items. We did this by a passing it test queries drawn from the evaluation data. The model then returned a set of results on which we did the same standard statistical tests we did for zero-shot evaluation.

Here are the results for the Multilingual CLIP model, using the same measure of the top 20 results of each query:

The results show that fine-tuning has a significant effect in improving results for Multilingual CLIP, although not a spectacular one.

Can English CLIP benefit from German data?
We also decided to check if the English-only CLIP model would get better if we fine-tuned it with German data. It might catch up in performance with a pre-trained multilingual model, if given a chance. The results were interesting. We include the Multilingual CLIP results in this table for comparison:

Using German training data, we were able to bring a vast improvement to the English-only CLIP model, although not enough to bring it up to even with the zero-shot level of the Multilingual CLIP model. Mean average precision for the English-only model jumped 420%, compared to 31% for Multilingual CLIP, although the overall performance of the monolingual model was still much worse.

Does more labeled data improve the search quality?

We also ran multiple fine-tuning experiments with differing amounts of training data, on both the Multilingual and English-only CLIP models, to see how effective using more data was.

In both, we see that most of the gain comes from the first few thousand items of training data, with gain coming more slowly after initially fast learning. This confirms a conclusion Jina AI has already published.

Read *Fine-tuning with Low Budget and High Expectations* for more discussion of the impressive results you can get with Finetuner and relatively little new training data.

Adding additional data may still improve results, but much more slowly. And in the case of fine-tuning the English-only CLIP model to handle German queries, we see performance improvement maximizes at less than 10,000 new items of data. It seems unlikely that we could train the English-only CLIP model to ever equal the Multilingual CLIP on German data, at least not using these kinds of methods.

Conclusion

What lessons can we take from all this?

Multilingual CLIP is the first choice for non-English queries

The Multilingual CLIP model, trained from scratch with multilingual data, outperforms comparable English-only CLIP models by a very large margin on the German data we used. The same conclusion will likely apply for other non-English languages.

Even in an unfair competition, where we fine-tuned the English model and vastly improved its performance on German data, the Multilingual CLIP model without further training outperformed it by a large margin.

Fine-tuning improves search quality with little data

We were shocked to see the English-only model improve its handling of German so much, and we see that we could have gotten nearly the same result using half as much data. The basic assumptions that go into fine-tuning are clearly very robust if they can teach German to an English model with only a few thousand examples.

On the other hand, we struggled to improve the performance of Multilingual CLIP, even with a fairly large quantity of high quality human-annotated training data. Although Finetuner makes a clear difference, you very rapidly reach upper bounds of how much you can improve a model that’s already pretty good.

Trouble-free fine-tuning using Finetuner

Finetuner is easy enough to use that we could construct and perform all the experiments in this article in a few days. Although it does take some understanding of deep learning to make the best configuration choices, Finetuner greatly reduces the boring labor of running and paying attention to large-scale neural network models to mere parameter setting.

Want to Search Inside Videos Like a Pro? CLIP-as-service Can Help

guoliwu — Thu, 09 Feb 2023 15:05:20 +0000

Wouldn’t it be great if you could search through a video the way you search through a text?

Imagine opening a digitized film, just hitting ctrl-f and typing “Santa”, then getting all the parts of the video with Santa Claus in it. Or just going to the command line and using the grep command:

grep "Santa Claus" Santa_Claus_conquers_The_Martians.mp4

Normally, this would be impossible, or only possible if you had gone through the film and carefully labeled all the parts with a Santa in them already. But with Jina AI and CLIP-as-service, you can create a video grep command for MP4 film with just a few Python functions and a standard computer setup. There is no need for a GPU and no complex AI tech stack to install, just off-the-shelf and open-source Python libraries, with Jina AI Cloud doing all the heavy lifting.

This has immediate applications for anyone who has video data: film archivists, stock image vendors, news photographers, or even regular people who just keep videos from their cellphones around and post them to social media.

Preliminaries

You need Python 3, and you might want to create a new virtual environment before starting. Then, install a few components at the command line with pip:

pip install clip_client "docarray[full]>=0.20.0"

This installs:

Jina AI’s DocArray library
Jina AI’s CLIP-as-service client

You'll also need an account for CLIP-as-service. If you don't already have one, there are instructions in the Jina AI documentation. Once you have an account, you will need a token. You can get one from your token settings page at Jina AI Cloud, or at the command line:

> jina auth login ⤶

Your browser is going to open the login page.
If this fails please open the following link: https://jina-ai.us.auth0.com/au....
🔐 Successfully logged in to Jina AI as....

> jina auth token create video_search ⤶
╭───────────── 🎉 New token created ─────────────╮
│ 54f0f0ef5d514ca1908698fc6d9555a5               │
│                                                │
│ You can set it as an env var JINA_AUTH_TOKEN   │
╰────── ☝️  This token is only shown once! ───────╯

Your token should looks something like: 54f0f0ef5d514ca1908698fc6d9555a5

Keep your token in an environment variable, or a Python variable if you're using a notebook. You will need it later.

Getting the Data

Loading MP4 video takes just one line of code with DocArray:

from docarray import Document

video_uri = "https://archive.org/download/santa-clause-conquers-the-martians/Santa%20Clause%20Conquers%20The%20Martians.ia.mp4"

video_data = Document(uri=video_uri).load_uri_to_video_tensor()

Loading trailer the to the 1964 film, Santa Claus Conquers the Martians

This downloads the trailer to the public domain film Santa Claus Conquers the Martians from the Internet Archive. You can substitute another URL or a file path to a local file to use your own video instead.

The video itself is stored as a numpy array in the Document.tensor attribute of the object:

print(video_data.tensor.shape)

(4264, 640, 464, 3)

The video itself is has 4,264 frames (the first dimension of the tensor), each measuring 640 pixels by 464 pixels (the second and third dimensions), and each pixel has three color dimensions (conventional RGB in the fourth dimension).

You can play the video in a notebook with the Document.display() method:

video_data.display()

click on the image to Watch the video

You can also use DocArray to view individual frames by their frame number. For example, frame #1400:

import numpy

Document(tensor=numpy.rot90(video_data.tensor[1400], -1)).display()

Frame 1400 of the trailer to Santa Claus conquers the Martians

And we can extract clips from the video by giving specific start and end frame numbers:

clip_data = video_data.tensor[3000:3300]
Document(tensor=clip_data).save_video_tensor_to_file("clip.mp4")
Document(uri="clip.mp4").display()

click the image, Watch the video

Extracting Keyframes

The procedure we’re using to search in videos is to extract keyframes and then perform our search on just those still images.

What is a keyframe?

A keyframe is a frame in a video that is the first after a break from smooth frame-to-frame transition. This is the same as a cut in film editing. We can identify them by going through the video frame by frame and comparing each frame to the next one. If the difference between the two frames is more than some value, we take that to mean that there was a cut, and the later frame is a keyframe.

A visual illustration of keyframes.

DocArray will automatically collect keyframes as it loads the video with the method Document.load_uri_to_video_tensor() and store them in the Document.tags dictionary under the key keyframe_indices:

print(video_data.tags['keyframe_indices'])

[0, 25, 196, 261, 325, 395, 478, 534, 695, 728, 840, 1019, 1059, 1131, 1191, 1245, 1340, 1389, 1505, 1573, 1631, 1674, 1750, 1869, 1910, 2010, 2105, 2184, 2248, 2335, 2585, 2618, 2648, 2706, 2756, 2788, 2906, 2950, 3050, 3100, 3128, 3190, 3216, 3314, 3356, 3421, 3514, 3586, 3760, 3828, 4078]

The frame numbers of the keyframes are stored in the Document.tags dictionary under the key keyframe_indices.

Performing Search

First, extract all the keyframes as images, and put each one into its own Document. Then compile all the frames into a DocumentArray object:

from docarray import Document, DocumentArray
from numpy import rot90

keyframe_indices = video_data.tags['keyframe_indices']
keyframes = DocumentArray()
for idx in range(0, len(keyframe_indices) - 1):
 keyframe_number = keyframe_indices[idx]
    keyframe_tensor = rot90(video_data.tensor[keyframe_number], -1)
    clip_indices = {
        'start': str(keyframe_number),
        'end': str(keyframe_indices[idx + 1]),
    }
    keyframe = Document(tags=clip_indices, tensor=keyframe_tensor)
    keyframes.append(keyframe)

The code above uses the Document.tags dictionary to store the frame number (as start) and the frame number of the next keyframe (as end) so that we can extract a video clip corresponding to that keyframe.

Then access CLIP-as-service, passing it the query text – in the example "Santa Claus" – and the collection of keyframe images, and it will return the images ranked by how well they match the query text:

from docarray import Document, DocumentArray
from clip_client import Client

server_url = "grpcs://api.clip.jina.ai:2096"

# substitute your own token in the line below!
jina_auth_token = "54f0f0ef5d514ca1908698fc6d9555a5"

client = Client(server_url,
                credential={"Authorization": jina_auth_token})
query = Document(text="Santa Claus", matches=keyframes)
ranked_result = client.rank([query])[0]

We transmit the query and keyframe images to Jina AI Cloud, and CLIP-as-service calculates an embedding vector for the text query and for each keyframe. Then, we measure the distance between the keyframe vectors and text query vectors. We return a list of keyframes ordered by their proximity to the text query in the embedding space.

You can see this represented in the figure below.

CLIP translates texts and images into vectors in a common embedding space where the distance between them reflects their semantic similarity. The embedding vectors of images with Santa Claus are much closer to the vector for the text "Santa Claus" than other images.

The query reorders the keyframe images in Document.matches in order from the best match to the worst.

print(ranked_result.matches[0].tags)

{'start': '2105', 'end': '2184'}

You can see in the Document.tags section that we've retained the information about this keyframe: It is frame #2105 and the next keyframe is at #2184. With this information, we can get the short video clip that this matches:

match = ranked_result.matches[0]
start_frame = int(match.tags['start'])
end_frame = int(match.tags['end'])
clip_data = video_data.tensor[start_frame:end_frame] 
Document(tensor=clip_data).save_video_tensor_to_file("match.mp4")
Document(uri="match.mp4").display()

click the image, Watch the video

The top five clips all contain Santa Claus:

This quick and simple technique brings very sophisticated multimodal AI to users with only a standard computer setup. CLIP-as-service is a powerful tool for anyone who needs to search through large volumes of digital media to find what they’re looking for. It saves time and helps you get the most out of your digital media collection.

So the next time you need to find Santa Claus, CLIP-as-service is here to help you look!

“Santa Claus with Martians, painted in the style of Norman Rockwell” according to DALL-E 2.

Autor:

Jie Fu,Scott Martens

Original link:

https://jina.ai/news/guide-using-opentelemetry-jina-monitoring-tracing-applications/

DEV Community: guoliwu

How to Manage Jina Resources with Namespace

Namespace

Schema

Access scope

Does this break anything?

Managing your resources under the namespace

Manage DocumentArrays

Manage Executors

This Week(s) in DocArray

MultiModalDataset

TensorFlow support

Nested class and multiprocessing

Nested class and multiprocessing

Support Proto 3 and 4

Join the conversation

Author

Qriginal Link

Fine-tuning with Low Budget and High Expectations

Experiment design

How much labeled data is needed for good fine-tuning?

How much time is needed for fine-tuning?

Summary

Author

Original Link

References

A Guide to Using OpenTelemetry in Jina for Monitoring and Tracing Applications

What problems can monitoring and tracing solve?

Pamazon 1.0: Logs as stdout

Pamazon 2.0: Structured and persistent logs

Pamazon 3.0: Tracing errors with logs

What is OpenTelemetry?

How to use OpenTelemetry integration in Jina?

Preparing the data

Building a Flow with tracing enabled

Indexing and searching

Enabling tracing on Executors

Collecting and analyzing traces

How to use Sentry to visualize and analyze the collected data?

Performance view

Summary view

Trace view

Customize your Sentry dashboard

Conclusion

This week(s) in DocArray

Less verbose API

__torch_function__ , or: How to give PyTorch a little bit more confidence

Early support for DocArray v2 in Jina

Pretty printing

Document Stores

Author

Original Link

How to Personalize Stable Diffusion for ALL the Things

Stable Diffusion: Fine-tune to your favorite artist (but forget everyone else)

DreamBooth: Fine-tune to your favorite artist (and remember!)

Unstable Confusion: The amnesia creeps back in

How can we cure DreamBooth’s amnesia?

BIG Metamodel: Fine-tune Stable Diffusion to your favorite artist and dog

The Hitchhiker's Guide to Building Your Own Metamodel

Results

What’s next?

Author

Orignal Link

Improving Search Quality for Non-English Queries with Fine-tuned Multilingual CLIP Models

Towards multilingual CLIP

How does multilingual CLIP work?

Deep dive of the tokenizer inside multilingual models

Multilingual vs. monolingual CLIP on the search quality

Experiment Setup

The German Fashion12k dataset

Download the dataset via DocArray

Load the multilingual CLIP model

Load the English CLIP model

Evaluate the zero-shot performance

Improve the search quality via fine-tuning

Specify hyperparameters

Specify the evaluation data

Put everything together in one call

Qualitative study on fine-tuned models

Does more labeled data improve the search quality?

Pamazon 1.0: Logs as `stdout`

`__torch_function__` , or: How to give PyTorch a little bit more confidence