Large Language Models (LLMs) have fundamentally changed how we interact with artificial intelligence, powering everything from advanced coding assistants to everyday conversational bots. However, relying on cloud-based giants like OpenAI or Google means sending your private data over the internet and paying for ongoing API costs.
The solution is running AI locally. Highly efficient models like Microsoft's Phi-3 are leading this revolution, proving that you no longer need massive data centers to achieve incredible results. Phi-3 delivers state-of-the-art reasoning and performance while being compact enough to run smoothly on your own private GPU server.
To harness the power of models like Phi-3 on an Ubuntu 24.04 GPU server, we will use Ollama. Ollama is a powerful, developer-friendly tool that completely simplifies the process of downloading, managing, and running open-source LLMs. Instead of wrestling with complex Python environments, complicated dependencies, and manual weights loading, Ollama acts as a streamlined local API server. It handles the heavy lifting of model inference in the background, automatically utilizing your server's GPU to ensure lightning-fast generation speeds.
Finally, an AI model isn't very accessible if you can only interact with it through a command-line terminal. That's where the WebUI comes in. In this tutorial, we will outline how to build a lightweight, Flask-based Web User Interface that connects directly to your local Ollama instance. This setup will give you a sleek, browser-based chat experience—just like ChatGPT—but hosted entirely on your own hardware. By the end of this guide, you will have a fully functional, highly performant AI ecosystem that offers complete control over your privacy, performance, and costs.
Note: This article provides a high-level overview of the architecture and steps. If you want to see the exact code, terminal commands, and scripts to build this, please visit the website linked at the bottom of this post!
Prerequisites
Before diving into the setup, ensure you have the following:
- An Ubuntu 24.04 server with an NVIDIA GPU.
- A non-root user or a user with sudo privileges.
- NVIDIA drivers installed on your server.
1. Install Ollama
Ollama provides a streamlined, lightweight environment for running powerful large language models (such as Phi-3) entirely locally. It simplifies the AI lifecycle by automatically managing model downloads, caching, and API serving. Plus, setting it up on Ubuntu 24.04 is a remarkably quick and frictionless process.
Ollama offers an automated shell script that installs everything you need in just a few moments straight from your terminal. Once the installation script sets up the necessary user groups and background services, all you have to do is reload your environment. From there, you can enable the Ollama service to start on boot and verify it is actively running by pinging its local port.
2. Run a Phi-3 Model with Ollama
With the Ollama backend up and running, you can now load the Phi-3 model; a compact, high-speed AI model ideal for local deployments on GPU-powered servers.
You can initiate the Phi-3 model directly through Ollama's built-in run command. If Phi-3 isn’t already on your system, Ollama handles the heavy lifting by downloading the model locally and immediately launching an interactive terminal session. This allows you to type messages and watch Phi-3 respond in real-time. Once you are finished testing its reasoning capabilities, you can safely exit the session to return to your standard prompt.
3. Use Ollama Programmatically with cURL
Beyond the terminal, Ollama excels via its local REST API—which is perfect for:
Building custom tools.
Integrating AI into your existing apps.
Testing complex prompts programmatically.
You can interact with the Ollama backend by sending a prompt to the Phi-3 model via a standard HTTP POST request. By sending a JSON payload containing the model name and your prompt, you bypass the terminal entirely. Ollama will return a structured JSON response containing the AI's generated text, context metrics, and processing duration, making it incredibly easy to parse for automated workflows.
4. Create a Flask WebUI for Chat Interface
Make interacting with Phi-3 more intuitive by creating a simple Flask WebUI. This interface lets you send prompts to the model, view responses instantly, and experience Phi-3 like a local ChatGPT!
Prepare the Environment: First, set up and activate an isolated Python virtual environment to keep your system clean, then install Flask and the necessary request libraries.
Build the Backend: Create a Flask application script. This script will set up a local web server and define a route that takes user input from the web page and forwards it to Ollama's local API generation endpoint.
Design the Frontend: Create a template folder and add an HTML index file. This file will serve as your visual chat interface, featuring a text area for inputs, a send button, and a JavaScript function to asynchronously fetch and display the AI's response.
Launch the UI: Start the Flask application and open your server's IP address on the designated port in your web browser.
You can now type your prompts for example, "Give me 3 startup ideas related to climate change" click send, and watch Phi-3 generate answers instantly right in your WebUI!
Final Thoughts
Running Phi-3 locally on Ubuntu 24.04 with Ollama gives you:
- Complete control over AI inference.
- Zero API keys or rate limits.
- Full privacy for your sensitive data. With just a conceptual understanding of these steps, you are well on your way to installing Ollama, launching Phi-3, and creating a custom Flask WebUI for real-time chat.

Top comments (0)