DEV Community: Stephan Janssen

LLM Inference using 100% Modern Java ☕️🔥

Stephan Janssen — Mon, 21 Oct 2024 18:37:50 +0000

In the rapidly evolving world of (Gen)AI, Java developers now have powerful new (LLM Inference) tools at their disposal: Llama3.java and JLama.

These projects brings the capabilities of large language models (LLMs) to the Java ecosystem, offering an exciting opportunity for developers to integrate advanced language processing into their applications.

Here's an example of Llama3.java providing inference for the DevoxxGenie IDEA plugin.

The JLama Project

JLama (a 100% Java inference engine) is developed by Jake Luciani and supports a whole range of LLM's :

Gemma & Gemma 2 Models
Llama & Llama2 & Llama3 Models
Mistral & Mixtral Models
Qwen2 Models
GPT-2 Models
BERT Models
BPE Tokenizers
WordPiece Tokenizers

Here's his Devoxx Belgium 2024 presentation with more information and demo's.

From a features perspective this is the most advanced Java implementation currently available. He even supports LLM sharding on layers and head attention level 🤩

Features includes:

Paged Attention
Mixture of Experts
Tool Calling
Generate Embeddings
Classifier Support
Huggingface SafeTensors model and tokenizer format
Support for F32, F16, BF16 types
Support for Q8, Q4 model quantization
Fast GEMM operations
Distributed Inference!

JLama requires Java 20 or later and utilises the new Vector API for faster inference.

You can easily run JLama on your computer, on Apple Silicon make sure you have an ARM based SDK.

export JAVA_HOME=/Library/Java/JavaVirtualMachines/liberica-jdk-21.jdk/Contents/Home

Now you can start JLama with the restapi param and the optional auto-download to start the inference service.

jlama restapi tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4 --auto-download

This will download the model if you haven't already.

The JLama3.java Project

The Llama3.java is also a 100% Java implementation developed by Alfonso² Peterssen and inspired by Andrej Karpathy.

Features includes:

Single file, no dependencies
GGUF format parser
Llama 3 tokenizer based on minbpe
Llama 3 inference with Grouped-Query Attention
Support Llama 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings)
Support for Q8_0 and Q4_0 quantizations
Fast matrix-vector multiplication routines for quantized tensors using Java's Vector API
Simple CLI with --chat and --instruct modes.
GraalVM's Native Image support (EA builds here)
AOT model pre-loading for instant time-to-first-token

Here's the Devoxx Belgium 2024 presentation by Alfonso and Alina.

Llama3.java + (OpenAI) REST API

The Llama3.java doesn't have any REST interface so I decided to contribute that part ❤️

I've added a Spring Boot wrapper around the core Llama3.java library, allowing developers to easily set up and run an OpenAI-compatible REST API for text generation and chat completions. The goal is to use this as the 100% Java inference engine for the IDEA DevoxxGenie plugin. Allowing local inference using a complete Java solution.

Code is available on GitHub

For the time being I've copied the Llama3.java source code into my project but ideally this should be integrated as a Maven dependency.

Key Features

OpenAI-compatible API: The project implements an API that mimics OpenAI's chat completions endpoint, making it easy to integrate with existing applications.
Support for GGUF Models: Llama3.java can work with GGUF (GPT-Generated Unified Format) models, which are optimised for efficiency and performance.
Vector API Utilization: The project leverages Java's incubator Vector API for improved performance on matrix operations.
Cross-Platform Compatibility: While optimized for Apple Silicon (M1/M2/M3), the project can run on various platforms with the appropriate Java SDK.

Getting Started

To get started with Llama3.java, follow these steps:

Setup: Ensure you have a compatible Java SDK installed. For Apple Silicon users, an ARM-compliant SDK is recommended.
Build: Use Maven to build the project with "mvn clean package".
Download a Model: Obtain a GGUF model from the Hugging Face model hub and place it in the 'models' directory.
Configure: Update the application.properties file with your model details and server settings.
Run: Start the Spring Boot application using the provided Java command.

DevoxxGenie

When the Llama3.java Spring Boot application is running, you can use DevoxxGenie for local inference 🤩

Future Directions

The next step is to move the MatMul bottleneck to the GPU using TornadoVM. Also once GraalVM supports

Externalise Llama3.java as a maven service dependency (if/when available)
Add GPU support using TornadoVM
GraalVM native versions 🍏
LLM sharding capabilities
Support for different models: BitNets & Ternary Models

Conclusion

Llama3.java and JLama represents a significant step forward in bringing large language model capabilities to the Java ecosystem. By providing an easy-to-use, OpenAI-compatible API and leveraging Java's latest performance features, this project opens up new possibilities for AI-driven applications in Java.

Whether you're building a chatbot, a content generation tool, or any application that could benefit from advanced language processing, Llama3.java and JLama offers a promising solution.

As these projects continues to evolve and optimise, it's well worth keeping an eye on for Java developers interested in the cutting edge of AI technology.

Exciting times for Java Developers! ☕️🔥❤️

~ Stephan Janssen

The Power of Full Project Context using LLM's

Stephan Janssen — Wed, 03 Jul 2024 08:10:25 +0000

I've tried integrating RAG into the DevoxxGenie plugin, but why limit myself to just some parts found through similarity search when I can go all out?

RAG is so June 2024 😂

Here's a mind-blowing secret: most of the latest features in the Devoxx Genie plugin were essentially 'developed' by the latest Claude 3.5 Sonnet large language model using the entire project code base as prompt context 🧠 🤯

It's like having an expert senior developer guiding the development process, suggesting 100% correct implementations for the following

Devoxx Genie features:

Allow a streaming response to be stopped
Keep selected LLM provider after settings page
Auto complete commands
Add files based on filtered text
Show file icons in list
Show plugin version number in settings page with GitHub link
Support for higher timeout values
Show progress bar and token usage bar

I've rapidly stopped my OpenAI subscription and gave my credit card details to Anthropic...

Full Project Context

A Quantum Leap Beyond GitHub Copilot

Imagine having your entire project at your AI assistant's fingertips. That's now a reality with the latest version of the Devoxx Genie IDEA plugin together with cloud based models like Claude Sonnet 3.5.

BTW How long will it take until we can do this with local models?!

Add full project to prompt

The latest version of the plugin allows you to add the full project to your prompt, your entire codebase now becomes part of the AI's context. This feature offers a depth of understanding that traditional code completion tools can only dream of.

Smart Model Selection and Cost Estimation

The language model dropdown is not just a list anymore, it's your 'compass' for smart model selection 🤩 👇🏼

See available context window sizes for each cloud model
View associated costs upfront
Make data-driven decisions on which model to use for your project

Visualizing Your Context Usage

Leverage the prompt cost calculator for precise budget management:

Track token usage with a progress bar
Get real-time updates on how much of the context window you're using

Calculate token cost with Claude Sonnet 3.5

Calculate cost with Google Gemini 1.5 Flash

Cloud Models Overview

Via the plugin settings pages you can see the "Token Cost & Context Window" for all the available cloud models. In a near future release you will be able to update this table. I should probably also support the local models context windows... #PullRequestsAreWelcome

Handling Massive Projects?

"But wait, my project is HUGE!" you might say 😅 Fear not. We've got options:

Leverage Gemini's Massive Context:

Gemini's colossal 1 million token window isn't just big, it's massive. We're talking about the capacity to ingest approximately 30,000 lines of code in a single prompt. That's enough to digest many codebases, from the tiniest scripts to some decent big projects.
But if that's not enough you have more options...

BTW Google will be releasing 2M and even 10M token windows in the near future.

Smart Filtering:

The new "Copy Project" plugin settings panel lets you

Exclude specific directories
Filter by file extensions
Remove JavaDocs to slim down your context

Selective Inclusion

Right-click to add only the most relevant parts of your project to the context and/or clipboard.

You can also copy your project to the clipboard, allowing you to paste your project code into an external chat window. This is a useful technique for sharing and collaborating on code 👍🏼

The Power of Full Context: A Real-World Example

The DevoxxGenie project itself, at about 70K tokens, fits comfortably within most high-end LLM context windows. This allows for incredibly nuanced interactions – we're talking advanced queries and feature requests that leave tools like GitHub Copilot scratching their virtual heads!

Conclusion: Stepping into the Future of Development

With Claude 3.5 Sonnet, Devoxx Genie isn't just another developer tool... it's a glimpse into the future of software engineering. As we eagerly await Claude 3.5 Opus, one thing is clear: we're witnessing a paradigm shift in AI-augmented programming.

Alan Turing, were he here today, might just say we've taken a significant leap towards AGI (for developers with Claude Sonnet 3.5)

Welcome to the cutting edge of AI-assisted development - welcome to DevoxxGenie 🚀

X Twitter - GitHub - IntelliJ MarketPlace

Devoxx Genie Plugin : an Update

Stephan Janssen — Tue, 28 May 2024 11:32:10 +0000

When I invited Anton Arhipov from JetBrains to present during the Devoxx Belgium 2023 keynote their early Beta AI Assistant, I was eager to learn if they would support local modals, as shown in the screenshot above.

After seven months without any related news, it seemed unlikely that this would happen. So, I decided to develop my own IDEA plugin to support as many local and event cloud-based LLMs as possible. "DevoxxGenie" was born ❤️

Of course, I conducted a market study and couldn't find any plugins that were fully developed in Java. Even GitHub Copilot, which doesn't allow you to select a local LLM, is primarily developed in Kotlin and native code. But more importantly, are often closed sourced.

I had already built up substantial LLM expertise by integrating LangChain4J into the CFP.DEV web app, as well as developing Devoxx Insights (using Python) in early 2023. More recently, I created RAG Genie, which allows you to debug your RAG steps using Langchain4J and Spring Boot.

Swing Development

I had never developed an IDEA plugin so I started studying some existing plugins to understand how they work. I noticed that some use a local web server, allowing them to more easily output the LLM response in HTML and stream it to the plugin.

I wanted to start with a simple input prompt and focus on using the "good-old" JEditorPane Swing component which does support basic HTML rendering.

By asking the LLM to respond in Markdown, I could parse the Markdown so each document node could be rendered to HTML while adding extra styling and UI components. For example, code blocks would include an easy to use "copy-to-clipboard" button or an "insert code" button (as shown above in screenshot).

Focus on Local LLM's

I focused on supporting Ollama, GPT4All, and LMStudio, all of which run smoothly on a Mac computer. Many of these tools are user-friendly wrappers around Llama.cpp, allowing easy model downloads and providing a REST interface to query the available models.
Last week, I also added "👋🏼 Jan" support because HuggingFace has endorsed this provider out-of-the-box.

Cloud LLM's, why not?

Because I use ChatGPT on a daily basis and occasionally experiment with Anthropic Claude, I quickly decided to also support LLM cloud providers. A couple of weeks ago, Google released Gemini with API keys for Europe, so I promptly integrated those too. With support for OpenAI, Anthropic, Groq, Mistral, DeepInfra, and Gemini, I believe I have covered all the major players in the field.
Please let me know if I'm missing any?

Multi-LLM Collaborative Review

The size of the chat memory can now be configured in v0.1.14 in the Settings page. This makes sense when you use an LLM which has a large window context, for example Gemini with 1M tokens.

The beauty of chat memory supporting different LLM providers is that with a single prompt, you can ask one model to review some code, then switch to another model to review the previous model's answer 🤩

Multi-LLM Collaborative Review

The end result is a "Multi-LLM Collaborative Review" process, leveraging multiple large language models to sequentially review and evaluate each other's responses, facilitating a more comprehensive and nuanced analysis.

The results are really fascinating, for example I asked Mistral how I could improve a certain Java class and have OpenAI (GPT-4o) review the Mistral response!

Switched to OpenAI GPT-4o and asked if it could review the Mistral response

This all results in better code (refactoring) suggestions 🚀

Streaming Responses

The latest version of DevoxxGenie (v0.1.14) now also supports the option to stream the results directly to the plugin, enhancing real-time interaction and responsiveness.

It's still a beta feature because I need to find a way to add "Copy to Clipboard" or "Insert into Code" buttons before each code block starts. I do accept PRs, so if you know how to make this happen, some community ❤️ would be very welcome.

Program Structure Interface Driven (PSI) Context Prompt

Another new feature I developed for v0.1.14 is support for "smart(er) prompt context" using Program Structure Interface (PSI). PSI is the layer in the IntelliJ Platform responsible for parsing files and creating the syntactic and semantic code model of a project.

PSI allows me to populate the prompt with more information about a class without the user having to add the extra info. It's similar to Abstract Syntax Tree (AST) in Java but PSI has extra knowledge about the project structure, externally used libraries, search features and much more.

As a result the PSIAnalyzerService class (with a Java focus) can inject automatically more code details in the chat prompt.

PSI driven context prompts are really another way to introduce some basic Retrieval Augmented Generation (RAG) into the equation 💪🏻

What's next?

Auto completion??

I'm not a big fan of auto completion "using TAB" where the editor is constantly bombarded with code suggestions which often don't make sense. Also because the plugin is LLM agnostic it would be much harder to implement because of (lack) of speed and quality while using local LLM's. However it could make sense to support this with currently smarter cloud based LLM's.

RAG support?

Embedding your IDEA project files using a RAG service could make sense. But this would probably need to happen outside of the plugin because of the storage and background processes needed to make this happen? I've noticed that existing plugins use an external Docker image which includes some kind of REST service. Suggestions are welcome.

"JIRA" support?

Wouldn't it be great if you are able to paste a (JIRA) issue and the plugin figures out how to fix/resolve the issue? A bit like what Devin was promised to do...

Compile & Run Unit tests?

When you ask the plugin to write a unit test, the plugin could also compile the suggested code and even run it (using REPL?). That would be an interesting R&D exercise IMHO.

Introduce Agents

All of the above basically results most likely in introducing smart(er) agents which do some extra LLM magic using shell scripts and or Docker services...

Community Support

As of this writing, the plugin has already been downloaded 1,127 times. The actual number is likely higher because the Devoxx Genie GitHub project also publishes plugin builds in the releases, allowing users to manually install them in their IDEA.

I'm hoping the project will gain more traction and that the developer community will step up to help with new features or even bug fixes. This was one of the main reasons for open-sourcing the project.
"We ❤️ Open Source" 😜