DEV Community: Alessandro T.

My Ghost Writer e lite.koboldai.net, una panoramica

Alessandro T. — Thu, 03 Jul 2025 20:01:14 +0000

Integrazione di My Ghost Writer con lite.koboldai.net, Un'Analisi Tecnica Approfondita

Tempo fa ho iniziato a scrivere la bozza di un testo. Un po' per curiosità professionale, un po' per semplice noia, ho deciso di pensare a quale tipo di applicazione dell'intelligenza artificiale fosse fattibile, a parte l'ovvi generazione di testo tramite un prompt ad un LLM.

In particolare ho notato specialmente i LLM (Large Language Models) "piccoli" abbiano la tendenza a ripetersi ed a inserire parole duplicate. Per questo motivo ho cercato un progetto open source che potessi eseguire sul mio pc e tramite cui individuare parole duplicate: non ho trovato niente di utile o che comunque facesse quel che volevo io.

Questo ha portato alla creazione di My Ghost Writer, un progetto open source che sta ora sto integrando in lite.koboldai.net — un'interfaccia web scritta in JS ed HTML senza dipendenze per KoboldCpp.

lite.koboldai.net

lite.koboldai.net è un'interfaccia web senza dipendenze progettata per l'uso come backend per modelli linguistici di grandi dimensioni (LLM) come KoboldCpp.
Funziona interamente nel browser (non richiede installazione) ed è confezionata come un singolo file HTML statico:

Modalità multiple: Modalità Storia, Modalità Chat, Modalità Istruttoria e Modalità Avventura per diversi tipi di interazione con l'IA.
Ampia compatibilità: Funziona con KoboldAI Client, KoboldCpp e AI Horde; supporta sia modelli locali che remoti.
Strumenti creativi: Include un editor di testo, la generazione di immagini tramite Stable Diffusion e il supporto per le schede dei personaggi e gli scenari.
Facile da usare: Facile da usare, stili dell'interfaccia utente personalizzabili e funzioni come il salvataggio automatico, il text-to-speech e le opzioni di ripetizione/modifica.

È una buona opzione se si desidera un'interfaccia leggera e flessibile per la narrazione, il gioco di ruolo o la scrittura assistita dall'intelligenza artificiale.
La struttura del codice è un po' disordinata:

Index.html monolitico con oltre 26000 righe di codice js, css e html.
Solo JS, nessun dattiloscritto ovviamente.
Il codice JS incorporato di terze parti è obsoleto.
Mancano test E2E.

Il Problema con WordSearch in lite.koboldai.net

WordSearch (basata sulla mia prima implementazione) in lite.koboldai.net fa semplicemente una ricerca testuale per rilevare duplicati avendo però limitazioni significative:

Identifica anche parti di testo non rilevanti (es. la singola lettera "a", anche dove presente dentro ad altre parole).
Non distingue tra parole semanticamente diverse (es. "the" e "they").

La Soluzione: Stemming NLP con My Ghost Writer

Per risolvere questo problema, ho reimplementato la logica di rilevazione dei duplicati utilizzando lo stemming NLP (tramite l'algoritmo Porter Stemming, già incluso dentro a lite.koboldai.net), che riduce le parole alla loro forma radice (es. "running" → "run"). Questo approccio:

Raggruppa parole semanticamente correlate (es. "run", "running", "ran").
Riduce i falsi positivi concentrandosi su veri duplicati.
Supporta sia l'input manuale che l'upload di file per flessibilità.

Funzionalità Attuali e Limitazioni

Funzionalità Principali

Ricerca delle parole duplicate, tramite stemming.
Thesaurus (work in progress):
- Alimentato da chiamate ad WordsAPI.
- Persistenza dei dati opzionale con un database MongoDB locale.
- Limitato a termini comuni ⚠️, non supporta (per ora) nomi propri o espressioni con parole multiple.

Tecnologie Utilizzate

Backend:
- Python 3.10+ con FastAPI per eseguire l'applicazione web.
- Structlog per il logging e la gestione degli errori.
- Poetry per la gestione delle dipendenze.
- Docker per la containerizzazione.
Frontend:
- JavaScript vanilla (nessun framework a causa dell'integrazione con lite.koboldai.net).
- Playwright per i test end-to-end (E2E).

Risorse

My Ghost Writer and lite.koboldai.net, an overview

Alessandro T. — Thu, 03 Jul 2025 19:59:40 +0000

My Ghost Writer and lite.koboldai.net, an overview

Some time ago I started drafting a text. Out of professional curiosity and sheer boredom, I wondered what kind of AI applications were feasible beyond the obvious text generation via prompts to LLMs.

In particular, I noticed that smaller LLM (Large Language Models) tend to repeat themselves and insert duplicate words. This led me to search for an open-source project I could run on my PC to identify duplicate words – but I found nothing useful or that did exactly what I wanted.

This ultimately led to the creation of My Ghost Writer, an open-source project now being integrated into lite.koboldai.net – a lightweight, dependency-free web interface for KoboldCpp.

lite.koboldai.net

lite.koboldai.net is a dependency-free web interface designed as a backend for large language models (LLM) like KoboldCpp.
It runs entirely in the browser (no installation required) and is packaged as a single static HTML file:

Multiple modes: Story mode, Chat mode, Instruction mode, and Adventure mode for different types of AI interaction.
Broad compatibility: Works with KoboldAI Client, KoboldCpp, and AI Horde; supports both local and remote models.
Creative tools: Includes a text editor, image generation via Stable Diffusion, and support for character sheets and scenarios.
User-friendly: Easy to use, customizable UI styles, and features like auto-save, text-to-speech, and repeat/edit options.

It's a great option if you want a lightweight, flexible interface for storytelling, role-playing, or AI-assisted writing.
However, the code structure is somewhat messy:

A monolithic index.html with over 26,000 lines of JS, CSS, and HTML.
Only vanilla JS, no TypeScript obviously.
Outdated third-party JS code.
Missing E2E tests.

The Problem with WordSearch in lite.koboldai.net

The initial version of WordSearch (based on my first implementation) in lite.koboldai.net used simple text search to detect duplicates, but had significant limitations:

Identified irrelevant text fragments (e.g., the single letter "a" even when embedded in other words).
Couldn't distinguish between semantically different words (e.g., "the" vs. "they").

The Solution: NLP Stemming with My Ghost Writer

To solve this, I reimplemented the duplicate detection logic using NLP stemming (via the Porter Stemming algorithm, already included in lite.koboldai.net), which reduces words to their root form (e.g., "running" → "run"). This approach:

Groups semantically related words (e.g., "run", "running", "ran").
Reduces false positives by focusing on real duplicates.
Supports both manual input and file upload for flexibility.

Current Features and Limitations

Main Features

Duplicate word detection via stemming.
Thesaurus (work in progress):
- Powered by calls to WordsAPI.
- Optional data persistence with a local MongoDB database.
- Limited to common terms ⚠️, doesn't support (for now) proper nouns or multi-word expressions.

Technologies Used

Backend:
- Python 3.10+ with FastAPI to run the webapp.
- Structlog for logging and error handling.
- Poetry for dependency management.
- Docker for containerization.
Frontend:
- Vanilla JavaScript (no framework due to integration with lite.koboldai.net).
- Playwright for end-to-end (E2E) testing.

Resources

AI Pronunciation Trainer

Alessandro T. — Mon, 16 Dec 2024 20:33:20 +0000

In questo articolo presento progetto a cui sto lavorando attualmente: AI Pronunciation Trainer (online qui), uno strumento progettato per aiutarvi a migliorare la vostra pronuncia utilizzando la potenza dell'intelligenza artificiale. Questo progetto è un refactor dell'originale AI Pronunciation Trainer di Thiagohgl a cui ho fatto diversi miglioramenti per rendere lo strumento più efficace e facile da usare.

Cos'è e cosa fa

AI Pronunciation Trainer è uno strumento che utilizza l'intelligenza artificiale per valutare la vostra pronuncia e fornire feedback, aiutandovi a migliorare e a essere compresi più chiaramente. Utilizza i modelli Silero STT / TTS, openai whisper e faster whisper per le funzionalità di speech-to-text (Silero permette anche di fare text-to-speech), garantendo una valutazione della pronuncia accurata e affidabile.

Refactor: aggiornamento delle Librerie Frontend e Backend

A proposito del backend:

PyTorch è adesso alla versione 2.6.x
aggiornato il modello Silero tedesco di Speech-to-Text per risolvere un bug che impediva l'utilizzo di PyTorch successivo alla versione 1.13.x.
Migliorati i test di backend python usando la mutation test suite Cosmic Ray
Risolto un bug per cui whisper non leggeva correttamente il timestamp finale for l'ultima parola nella registrazione (alla fine ho risolto usando il pacchetto pip openai whisper)
Aggiunto supporto per il pacchetto pip faster whisper:
- evita i valori None sui end_ts timestamp nell'ultima parola della registrazione al contrario dell'output dell'output di whisper creato con la pipeline HuggingFace
- permette di individuare momenti di silenzio prolungato tramite silero-vad

Inoltre, per quanto riguarda il frontend:

Aggiornate le librerie javascript utilizzando le versioni più recenti di jQuery (3.7.1) e Bootstrap (5.3.3)
Nuovo frontend basato su Gradio 5.x
Aggiunti test E2E con Playwright
Aggiunta la possibilità di scrivere, leggere ed ovviamente valutare una frase a scelta libera
Tour guidato per i nuovi utenti con driver.js ed css/javascript custom dentro ai Gradio blocks
Riproduzione delle singole parole nella registrazione seguite dalla pronuncia 'ideale' della stessa parola letta dal motore Text-to-Speech
Aggiunto anche una funzionalità di Text-to-Speech in-browser (su Windows 11 funziona solo nel caso siano installati i pacchetti linguistici inglesi e tedesco)
Frontend custom webApp - migliorato lo stile CSS su dispositivi mobile

Versione online: la demo nello spazio HuggingFace

Potete provare online il mio progetto sul mio HuggingFace Space. Questa demo online vi permette di sperimentare le capacità dello strumento senza alcuna installazione o configurazione. Lo spazio HuggingFace fornisce un modo conveniente e accessibile per testare AI Pronunciation Trainer e vedere come può aiutarvi a migliorare la vostra pronuncia. Si prega di essere pazienti, a volte è un po' lento oppure in sleeping nel caso non sia utilizzato da nessuno da un po' (localmente è molto più veloce, soprattutto se avete un computer potente). Esiste anche una versione embedded dello spazio HuggingFace.

Lavori Futuri

Pur funzionando piuttosto bene, ci sono ovviamente margini di miglioramento. Ecco alcuni dei miglioramenti futuri che intendo implementare:

Ricevere feedback dall'autore del lavoro originale sulla mia documentazione e sulle modifiche
Chiedere all'autore del lavoro originale alcune spiegazioni sulle scelte architetturali e funzionali che ha fatto
Valutare il passaggio da PyTorch ad ONNX Runtime
Aggiungere più test E2E con Playwright

Conclusione

Ritengo che AI Pronunciation Trainer sia uno strumento utile per chiunque desideri migliorare in autonomia la propria pronuncia. Con la potenza dell'IA ed i miglioramenti apportati durante il refactor, questo strumento fornisce feedback accurati e affidabili per aiutarvi a parlare in modo più chiaro e sicuro. Vi invito a provare la demo HuggingFace Space e capire come questo progetto possa aiutarvi nel vostro percorso verso una migliore pronuncia.

AI Pronunciation Trainer

Alessandro T. — Mon, 16 Dec 2024 20:31:01 +0000

AI Pronunciation Trainer

In this article, I present the project I am working on: AI Pronunciation Trainer (online here), a tool designed to help you improve your pronunciation using the power of artificial intelligence. This project is a refactor of the original AI Pronunciation Trainer by Thiagohgl to which I have made several improvements to make the tool more effective and easier to use.

What it is and what it does

AI Pronunciation Trainer is a tool that uses AI to evaluate your pronunciation and provide feedback, helping you to improve and be understood more clearly. It leverages the Silero STT / TTS, openai whisper and faster whisper models for speech-to-text functionalities (Silero does also text-to-speech), ensuring accurate and reliable pronunciation assessment.

Refactor: upgraded frontend and backend libraries

About the backend:

Updated PyTorch at version 2.6.x
Updated Silero German Speech-to-Text model to resolve a bug that prevented the use of PyTorch versions later than 1.13.x
Improved backend tests with the mutation test suite Cosmic Ray
Fixed a bug with whisper not properly transcribing the end timestamp for the last word in the recorded audio (in the end I solved it switching to the openai whisper python pip package)
Added faster whisper model support:
- it avoids None values on end_ts timestamps for the last elements, unlike the HuggingFace Whisper's output
- it uses silero-vad to detect long silences within the audio

Furthermore, regarding the frontend:

Updated the JavaScript libraries using the latest versions of jQuery (3.7.1) and Bootstrap (5.3.3)
New frontend based on Gradio 5.x
Added E2E tests with Playwright
Added the ability to insert custom sentences to read and evaluate
Onboarding tour for new users made with driver.js and custom css/javascript in Gradio blocks
Playback of individual words in the recording followed by the 'ideal' pronunciation of the same word read by the Text-to-Speech engine
Also added an in-browser Text-to-Speech functionality (on Windows 11, it only works if the English and German language packs are installed)
Custom webApp frontend - improved CSS style on mobile devices

Online version: the HuggingFace Space Demo

You can try it online using the HuggingFace Space. This online demo allows you to experience the tool's capabilities without any installation or configuration. The HuggingFace Space provides a convenient and accessible way to test the AI Pronunciation Trainer and see how it can help you improve your pronunciation. Please be patient, sometimes it is a bit slow or in sleeping mode (locally it is much faster, especially if you have a powerful computer). There is also an embedded version of my HuggingFace Space.

Future Work

Although this tool works pretty good, there are still some areas for improvement. Here are some of the future enhancements I plan to implement:

Receive feedback from the original project author (Thiago Lobato) on my documentation and changes
Ask the original author for explanations on the architectural and functional choices he made
Explore transitioning PyTorch to onnxruntime (if possible)
Re-add the docker container (if possible)

Conclusion

I believe AI Pronunciation Trainer is a valuable tool for anyone looking to improve their pronunciation. With the power of AI and the improvements made in the refactoring project, this tool provides accurate and reliable feedback to help you speak more clearly and confidently. I invite you to try the HuggingFace Space demo and understand how this little project can help you on your journey to better pronunciation.

LISA+SamGIS adattato ad hardware HuggingFace ZeroGPU

Alessandro T. — Wed, 14 Aug 2024 21:15:40 +0000

LISA+SamGIS adattato ad hardware HuggingFace ZeroGPU

Per una comprensione di base del mio progetto, si veda questa e questa pagina.

Oggi invece sto scrivendo della mia nuova demo utilizzando un hardware ZeroGPU. Si noti che ZeroGPU Spaces è attualmente in versione beta. Gli utenti PRO o le Enterprise organizations possono creare i propri space ZeroGPU a loro nome. Inoltre è necessario pagare ogni mese per mantenere il diritto di utilizzare l'hardware ZeroGPU.

Ho riscontrato inizialmente dei problemi causati dall’uso del decoratore spaces.GPU su una funzione inappropriata la cui esecuzione richiedeva troppo tempo, causando timeout. Risolto facendo debug per usare il decoratore solo sulle funzioni che ne richiedevano effettivamente l’uso.
Frontend custom: non mi piace molto svelte (la libreria js scelta dal team di Gradio) ma soprattutto ho già un progetto ben avviato scritto in vuejs e vite che voglio riutilizzare. Risolto facendo l’installazione del pacchetto Debian nodejs 18 per poi installare le dipendenze e fare la build del progetto nodejs direttamente da dentro il file app.py usando subpropcess.run().

Nota che sto usando un periodo di timeout di 48 ore prima di mettere in pausa il mio space. Qualsiasi interazione successiva potrebbe richiedere un po' di tempo prima che lo space riparta.

Ultimo, ma non ultimo, la pagina della demo è online qui (interfaccia Gradio) e qui (la mia pagina SPA custom).

LISA+SamGIS on ZeroGPU HuggingFace hardware

Alessandro T. — Wed, 14 Aug 2024 21:15:25 +0000

LISA+SamGIS on ZeroGPU HuggingFace hardware

See this and this page for a basic understand of what is about my project.

Today instead I'm writing about my new demo on ZeroGPU space. Note that ZeroGPU Spaces is currently in beta. PRO users or Enterprise organizations can host their own ZeroGPU Spaces under their namespaces. Also there is need to pay every month for keep the right to use ZeroGPU hardware.

I solved some problems caused by spaces.GPU decorator on a function which execution time was too high, causing a timeout. To solve it I started debugging and I ended using spaces.GPU only on functions that really needed the GPU acceleration.
I don't like very much svelte (the js library chosen by Gradio team) and I already have a vuejs/vite frontend project that I can re-use. I solved this installing the nodejs 18 Debian package and starting the nodejs build from within the app.py file using subpropcess.run().

Note that I'm using a timeout period of 48h before putting my space in pause. Any interaction after that could take a while until the space restart.

Last but not least there is my online demo here (Gradio interface) and here (my custom SPA page).

SamGIS - Alcuni appunti su Segment Anything

Alessandro T. — Mon, 27 May 2024 18:27:17 +0000

SamGIS - Alcuni appunti su Segment Anything

Rimando alle mie note in inglese su Segment Anything.

A proposito del riutilizzo degli embedding delle immagini e SamGIS

Dopo aver riletto questo paper ho capito che avrei potuto migliorare l'efficienza di SamGIS conservando e riutilizzando gli embedding delle immagini.

Ho implementato questa modifica in SamGIS versione 1.3.0. Alcuni dati di test dalla demo SamGIS che ho utilizzato:

prima chiamata: 5.42s
- modello fastsam istanziato
- immagine creata dalla mappa web (uso OpenStreetMap come tile provider e Mapnik come layer della webmap)
- creato il embedding dell'immagine
seconda chiamata: 0,41 s
dalla terza alla settima chiamata: ~0,34s

Si tenga presente che effettuando una chiamata immediatamente dopo l'altra la durata rimane bassa, probabilmente a causa dell'utilizzo della cache durante il download delle tile nel back-end. Aspettando più di 10 minuti sembra invalidare la cache, quindi contextily (la libreria di GeoPanda che utilizzo come client di Tiles) ha impiegato da 0.5s a 1.5s di tempo, durante le mie prove, per il download delle tile.

Espandere qui per il dettaglio del payload delle chiamate di test.

{
    "bbox": {
        "ne": {
            "lat": 46.236615111857255,
            "lng": 9.519996643066408
        },
        "sw": {
            "lat": 46.13405108959001,
            "lng": 9.29821014404297
        }
    },
    "prompt": [
        {
            "id": 146,
            "type": "point",
            "data": {
                "lat": 46.18483299780137,
                "lng": 9.418864745562386
            },
            "label": 1
        }
    ],
    "zoom": 13,
    "source_type": "OpenStreetMap"
}

A proposito della conversione "dal testo alla maschera" "zero shot": LISA e SamGIS

La versione originale di SAM può utilizzare anche semplici prompt testuali in linguaggio naturale. Per un uso pratico di questa funzionalità, si veda:

Naturalmente potrebbe interessare anche il mio lavoro di integrazione di LISA con SamGIS e la corrispondente demo. Devo tenerlo in pausa a causa dei costi, ma sto richiedendo l'uso di una GPU gratuita da HuggingFace.

Nel caso il mio progetto fosse interessante, metti "mi piace" o commenta il thread di richiesta di risorse GPU di HuggingFace.

SamGIS - Some notes about Segment Anything

Alessandro T. — Mon, 27 May 2024 18:27:09 +0000

SamGIS - Some notes about Segment Anything

From the Segment Anything paper

"SAM" is a foundation model aiming for performing "zero-shot" image segmentation:

it's build and trained with a large image dataset with a massive amount of segmentation masks
the SAM team propose the "promptable" segmentation task, where the goal is to return a valid segmentation mask given any segmentation prompt.

Since this model should perform "zero-shot" segmentation the model must support flexible prompts, needs to compute masks in amortized real-time to allow interactive use and must be ambiguity-aware. That's the model architecture:

source 1: an image encoder computes an image embedding
source 2: a fast prompt encoder embeds prompts
output: a fast mask decoder combines these two sources to predict segmentation masks

Because annotation masks are not abundant online, especially of high quality, the SAM developers opted for developing a "data engine", developing both the model and the dataset annotations (from manual stage to semi-automated to fully automated). Images in SA-1B span a geographically and economically diverse set of countries and we found that SAM performs similarly across different groups of people.

Segment Anything Tasks

Task

Here SAM team translate prompts from NLP to segmentation (selecting/de-selecting points, box, mask, free-form text). Like a language model should output a coherent response to an ambiguous prompt, the promptable segmentation task should return a valid segmentation mask given any prompt.

Pre-Training

The promptable segmentation task suggests a natural pre-training algorithm that simulates a sequence of prompts (e.g., points, boxes, masks) for each training sample and compares the model’s mask predictions against the ground truth.

Segment Anything Model

Image encoder

The algorithm use a MAE ("Masked Autoencoders Are Scalable Vision Learners") pre-trained Vision Transformer (ViT) minimally adapted to process high resolution inputs.

Prompt encoder

SAM supports two sets of prompts:

sparse (points, boxes, text)
dense (masks)

SAM prompts handle points and boxes by positional encodings summed with learned embeddings for each prompt type. Dense prompts (i.e., masks) are embedded using convolutions and summed element-wise with the image embedding.

Mask decoder

The mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask. This design employs a modification of a Transformer decoder block followed by a dynamic mask prediction head. The decoder block uses prompt self-attention and cross-attention in two directions (prompt-to-image embedding and vice-versa) to update all embeddings. After running two blocks, the procedure upsample the image embedding and an MLP maps the output token to a dynamic linear classifier, which then computes the mask foreground probability at each image location.

Resolving ambiguity

With one output, to avoid masks merging in case of an ambiguous prompt the model can predict more than one output mask for a single prompt. 3 masks should address most common cases (nested masks are often at most three deep: whole, part, and subpart). During training, the procedure backprops only the minimum loss over masks. To rank masks, the model predicts a confidence score (i.e., estimated IoU) for each mask.

About image embedding re-use and SamGIS

After reading this paper I understood that I could improve SamGIS software design storing and re-using the image embeddings.

I implemented this change in SamGIS version 1.3.0. Some test data from the SamGIS demo I used:

first request: 5.42s
- instantiated fastsam model
- created image from webmap (I'm using OpenStreetMap as tiles provider and Mapnik as map layer)
- created image embedding
second request: 0.41s
from third to seventh request: ~0.34s

Note that making one request immediately after another keep requests duration low probably because of cache during tiles download on backend side. Instead waiting more than 10 minutes it seems invalidate the cache, then contextily (the GeoPandas' library that I use as a tiles client) added from 0.5s to 1.5s of time, during my tests, to download the tiles.

Click here to show my test request payload

{
    "bbox": {
        "ne": {
            "lat": 46.236615111857255,
            "lng": 9.519996643066408
        },
        "sw": {
            "lat": 46.13405108959001,
            "lng": 9.29821014404297
        }
    },
    "prompt": [
        {
            "id": 146,
            "type": "point",
            "data": {
                "lat": 46.18483299780137,
                "lng": 9.418864745562386
            },
            "label": 1
        }
    ],
    "zoom": 13,
    "source_type": "OpenStreetMap"
}

About Zero-Shot Text-to-Mask: LISA and SamGIS

SAM can use also simple free-form text prompts. For a practical use of this feature, see:

Of course could be of your interest also my integration work of LISA with SamGIS and its demo. I need to keep it paused because of cost, but I am requesting the use of a free GPU from HuggingFace.

If you like my project, please like or comment on the HuggingFace GPU resource request thread.

LISA integrato in SamGIS

Alessandro T. — Sun, 26 May 2024 14:38:43 +0000

LISA integrato in SamGIS

La segmentazione d'immagine è un compito cruciale nella visione artificiale, dove l'obiettivo di fare "instance segmentation" di un dato oggetto. Ho già lavorato ad un progetto, SamGIS, a riguardo. Un passo logico successivo sarebbe integrare la capacità di riconoscere gli oggetti attraverso prompt testuali. Quest'attività apparentemente semplice in effetti comporta però delle differenze rispetto a quanto fatto in SamGIS che utilizza Segment Anything (il backend di machine learning usato da SamGIS). Mentre infatti "SAM" non categorizza ciò che identifica, partire da un prompt scritto necessita della conoscenza di quali classi di oggetti esistano nell'immagine in analisi. Un modello di linguaggio visivo (o VLM) che funziona bene per questo compito è LISA. Gli autori di LISA hanno basato il loro lavoro su Segment Anything e Llava, un LLM con capacità multimodali (può elaborare sia istruzioni di testo che immagini). Sfruttando le capacità di "segmentazione ragionata" di LISA, SamGIS può eseguire analisi di tipo "zero-shot", ovvero senza addestramento pregresso specifico e specialistico in ambito geologico, geomorfologico o fotogrammetrico.

Prompts testuali d'input e relativi geojson di output

Non riesco a mostrare qui su dev.to questa parte, quindi rimando alla pagina dedicata sul mio blog.

Durata dei task di segmentazione

Al momento, un prompt che richieda anche la spiegazione di quanto identificato nell'immagine rallenta notevolmente l'analisi. Lo stesso prompt d'analisi eseguito sulla stessa immagine però senza richieste di spiegazione viene elaborato molto più velocemente. I test contenenti richieste di spiegazioni vengono eseguiti in più di 60 secondi mentre senza la durata è intorno o inferiore a 4 secondi, utilizzando il profilo hardware HuggingFace "Nvidia T4 Small" con 4 vCPU, 15 GB RAM e 16 GB VRAM.

Architettura software

Dal punto di vista tecnico e architetturale, la demo consiste di un frontend simile a quello sulla demo di SamGIS. Niente barra degli strumenti per disegnare, sostituita dalla casella di testo per le richieste in linguaggio naturale. Il backend utilizza un'API basata su FastAPI e che invoca una funzione ad hoc basata su LISA.

Ho dovuto mettere in pausa la demo a causa del costo della GPU, ma sto richiedendo l'uso di una GPU gratuita da HuggingFace. Non esitate a contattarmi su LinkedIn per una dimostrazione dal vivo, chiedere maggiori informazioni o ulteriori chiarimenti.

LISA adapted to SamGIS

Alessandro T. — Sun, 26 May 2024 14:38:30 +0000

LISA adapted to SamGIS

Image segmentation is a crucial task in computer vision, where the goal is to extract the instance segmentation mask for a desired object within the image. I've already worked on a project, SamGIS, that focuses on this particular application of computer vision. A logical progression now would be incorporating the ability to recognize objects through text prompts. This apparently simple activity is actually different compared to what Segment Anything (the ML backend used by SamGIS) does. In fact "SAM" does not outputs descriptions nor categorizations for its input images. Starting from a written prompt at the contrary requires understanding which classes of objects exist in the image under analysis. A visual language model (or VLM) that performs well for this task is LISA. LISA's authors built their work on top of Segment Anything and Llava, a large language model with multimodal capabilities (it can process both text prompts and images). By leveraging LISA's "reasoned segmentation" abilities, SamGIS can now conduct "zero-shot" analyses, meaning it can operate without specific or specialistic prior training in geological, geomorphological, or photogrammetric fields.

Some input text prompts with their geojson outputs

I can't show this part on dev.to, then I refer you to my blog page.

Duration of segmentation tasks

At the moment, a prompt that also requires an explanation about the segmentation task slows down greatly the analysis. The same prompt on the same image without "descriptive" or "explanatory" questions instead finish much faster. Tests with explanatory text perform in more than 60 seconds while without duration is between 3 and 8 seconds, using the HuggingFace hardware profile "Nvidia T4 Small" with 4 vCPU, 15 GB RAM and 16 GB VRAM.

Software architecture

Technically and architecturally, the demo consists of a frontend page like SamGIS demo. Instead of the drawing tool bar there is a text prompt for natural language requests with some selectable examples displayed at the top of the page. The backend utilizes a FastAPI-based API that calls a custom LISA function wrapper.

Unfortunately I have to pause my demo due to GPU cost, but I am requesting the use of a free GPU from HuggingFace. Please feel free to reach out to me on LinkedIn for a live demonstration, ask for more information or further clarifications.

Cosa ho imparato durante lo sviluppo di SamGIS con LISA (finora)

Alessandro T. — Sun, 26 May 2024 13:57:08 +0000

Cosa ho imparato durante lo sviluppo di SamGIS con LISA (finora)

Leggere le pubblicazioni inerenti ai progetti su cui lavoro

Per migliorare la mia comprensione del mio progetto di machine learning ho deciso di leggere l'articolo su cui si basano LISA e Segment Anything. Oltre ad alcune informazioni teoriche su LLM, ho notato che l'architettura modulare di "SAM" consente di creare e riutilizzare gli image embedding. Dato che SamGIS non funzionava in questo modo inizialmente, ho formulato un'ipotesi al riguardo.

Debug, misure ed ottimizzazione: ipotesi sul image embedding

A questo punto ho continuato il mio lavoro di debug misurando la durata dei singoli passaggi durante l'esecuzione delle funzioni di SamGIS. La creazione di un image embedding è un'operazione abbastanza onerosa, quindi è vantaggioso salvarlo e riutilizzarlo (ho verificato implementare la mia ipotesi migliorerebbe le prestazioni del software). Utilizzando il profilo hardware HuggingFace "Nvidia T4 Small" (con 4 vCPU, 15 GB RAM e 16 GB VRAM) è possibile risparmiare circa 1 secondo per ogni inferenza successiva alla prima, utilizzando la stessa immagine (quindi senza modificare il tile provider e l'area geografica).

Il ruolo dei LLM con prompt aventi differenti caratteristiche

LISA eredita le capacità di generazione del linguaggio dei LLM multi-modali come Llava. Questi modelli eccellono nella gestione di ragionamenti complessi, conoscenza del mondo, risposte esplicative e conversazioni a più turni. Sono strumenti potenti per colmare il divario tra testo e comprensione visiva.

LISA permette di effettuare ragionamenti piuttosto complessi durante la segmentazione delle immagini (es. "identify the houses near the trees..." vs "identify the houses...") senza particolari peggioramenti prestazionali. Al contrario, richieste contenenti la spiegazione del motivo ("explain why") per cui il task di segmentazione sia fatto in un certo modo avranno tempi di esecuzione molto più elevati (nell'ordine di minuti).

Sono disponibili maggiori dettagli qui su questi miglioramenti in seguito alle modifiche descritte e relativamente alle differenti prestazioni dovute a diversi casi durante l'utilizzo di SamGIS con LISA.

What I learnt from development on LISA with SamGIS (So far)

Alessandro T. — Sun, 26 May 2024 13:52:16 +0000

What I learnt from development on LISA with SamGIS (So far)

Read publications related to the projects I work on

To improve my understanding of my machine learning project I decided to read the papers on which LISA and Segment Anything are based. Besides some theoretical informations about LLM, I noticed that the modular architecture of "SAM" permits to save and re-use image embeddings. Since SamGIS didn't work this way initially, I formulated an hypothesis about this.

Debugging, measures and optimization: Image Embedding Hypothesis

At this point I continued my debugging work by measuring the duration of individual steps during the execution of SamGIS functions. Creating an image embedding is quite an expensive operation, so it is advantageous saving it and re-using it (I verified that implementing my hypothesis would improve the performance of the software). Using the HuggingFace hardware profile "Nvidia T4 Small" (with 4 vCPU, 15 GB RAM and 16 GB VRAM) it's possible to save almost 1 second on every inference after the first, using the same image (without change the geographical area tiles provider).

The role of LLMs with prompts having different characteristics

LISA inherits the language generation capabilities of multi-modal LLMs such as Llava. These models excel at handling complex reasoning, world knowledge, explanatory answers and multi-turn conversations. They’re powerful tools for bridging the gap between text and visual understanding.

LISA allows you to perform rather complex reasoning during image segmentation (e.g. "identify the houses near the trees..." vs "identify the houses...") without any particular performance degradation. On the contrary, requests containing the explanation of reason ("explain why") the segmentation task is done in a certain way will have much higher execution times (in the order of minutes).

There are more details here about these improvements following the changes described and regarding different performance due to different cases when using SamGIS with LISA.