mirror of
https://github.com/varunvasudeva1/llm-server-docs.git
synced 2025-12-16 11:37:45 +01:00
Add llama.cpp and LLM service sections, update HF section, consolidate redundant sections
This commit is contained in:
428
README.md
428
README.md
@@ -1,7 +1,7 @@
|
||||
# Local LLaMA Server Setup Documentation
|
||||
|
||||
_TL;DR_: A comprehensive guide to setting up a fully local and private language model server equipped with the following:
|
||||
- Inference Engine ([Ollama](https://github.com/ollama/ollama), [vLLM](https://github.com/vllm-project/vllm))
|
||||
- Inference Engine ([Ollama](https://github.com/ollama/ollama), [llama.cpp](https://github.com/ggml-org/llama.cpp), [vLLM](https://github.com/vllm-project/vllm))
|
||||
- Chat Platform ([Open WebUI](https://github.com/open-webui/open-webui))
|
||||
- Text-to-Speech Server ([OpenedAI Speech](https://github.com/matatonic/openedai-speech), [Kokoro FastAPI](https://github.com/remsky/Kokoro-FastAPI))
|
||||
- Text-to-Image Server ([ComfyUI](https://github.com/comfyanonymous/ComfyUI))
|
||||
@@ -16,31 +16,33 @@ _TL;DR_: A comprehensive guide to setting up a fully local and private language
|
||||
- [Prerequisites](#prerequisites)
|
||||
- [Docker](#docker)
|
||||
- [HuggingFace CLI](#huggingface-cli)
|
||||
- [Managing Models](#managing-models)
|
||||
- [Download Models](#download-models)
|
||||
- [Delete Models](#delete-models)
|
||||
- [General](#general)
|
||||
- [Drivers](#drivers)
|
||||
- [Startup Script](#startup-script)
|
||||
- [Scheduling Startup Script](#scheduling-startup-script)
|
||||
- [Configuring Script Permissions](#configuring-script-permissions)
|
||||
- [Auto-Login](#auto-login)
|
||||
- [Inference Engine](#inference-engine)
|
||||
- [Ollama](#ollama)
|
||||
- [llama.cpp](#llamacpp)
|
||||
- [vLLM](#vllm)
|
||||
- [Serving a Different Model](#serving-a-different-model)
|
||||
- [Managing Models](#managing-models)
|
||||
- [Creating a Service](#creating-a-service)
|
||||
- [Open WebUI Integration](#open-webui-integration)
|
||||
- [Comparison](#comparison)
|
||||
- [Ollama vs. llama.cpp](#ollama-vs-llamacpp)
|
||||
- [vLLM vs. Ollama/llama.cpp](#vllm-vs-ollamallamacpp)
|
||||
- [Chat Platform](#chat-platform)
|
||||
- [Open WebUI](#open-webui)
|
||||
- [Text-to-Speech Server](#text-to-speech-server)
|
||||
- [OpenedAI Speech](#openedai-speech)
|
||||
- [Downloading Voices](#downloading-voices)
|
||||
- [Open WebUI Integration](#open-webui-integration-1)
|
||||
- [Kokoro FastAPI](#kokoro-fastapi)
|
||||
- [Open WebUI Integration](#open-webui-integration-2)
|
||||
- [Comparison](#comparison-1)
|
||||
- [Open WebUI Integration](#open-webui-integration-1)
|
||||
- [Comparison](#comparison)
|
||||
- [Text-to-Image Server](#text-to-image-server)
|
||||
- [ComfyUI](#comfyui)
|
||||
- [Open WebUI Integration](#open-webui-integration-3)
|
||||
- [Open WebUI Integration](#open-webui-integration-2)
|
||||
- [SSH](#ssh)
|
||||
- [Firewall](#firewall)
|
||||
- [Remote Access](#remote-access)
|
||||
@@ -50,28 +52,27 @@ _TL;DR_: A comprehensive guide to setting up a fully local and private language
|
||||
- [Local DNS](#local-dns)
|
||||
- [Third-Party VPN Integration](#third-party-vpn-integration)
|
||||
- [Verifying](#verifying)
|
||||
- [Ollama](#ollama-1)
|
||||
- [vLLM](#vllm-1)
|
||||
- [Inference Engine](#inference-engine-1)
|
||||
- [Open WebUI](#open-webui-1)
|
||||
- [OpenedAI Speech](#openedai-speech-1)
|
||||
- [Kokoro FastAPI](#kokoro-fastapi-1)
|
||||
- [Text-to-Speech Server](#text-to-speech-server-1)
|
||||
- [ComfyUI](#comfyui-1)
|
||||
- [Updating](#updating)
|
||||
- [General](#general-1)
|
||||
- [Nvidia Drivers \& CUDA](#nvidia-drivers--cuda)
|
||||
- [Ollama](#ollama-2)
|
||||
- [vLLM](#vllm-2)
|
||||
- [Ollama](#ollama-1)
|
||||
- [llama.cpp](#llamacpp-1)
|
||||
- [vLLM](#vllm-1)
|
||||
- [Open WebUI](#open-webui-2)
|
||||
- [OpenedAI Speech](#openedai-speech-2)
|
||||
- [Kokoro FastAPI](#kokoro-fastapi-2)
|
||||
- [OpenedAI Speech](#openedai-speech-1)
|
||||
- [Kokoro FastAPI](#kokoro-fastapi-1)
|
||||
- [ComfyUI](#comfyui-2)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
- [`ssh`](#ssh-1)
|
||||
- [Nvidia Drivers](#nvidia-drivers)
|
||||
- [Ollama](#ollama-3)
|
||||
- [vLLM](#vllm-3)
|
||||
- [Ollama](#ollama-2)
|
||||
- [vLLM](#vllm-2)
|
||||
- [Open WebUI](#open-webui-3)
|
||||
- [OpenedAI Speech](#openedai-speech-3)
|
||||
- [OpenedAI Speech](#openedai-speech-2)
|
||||
- [Monitoring](#monitoring)
|
||||
- [Notes](#notes)
|
||||
- [Software](#software)
|
||||
@@ -83,7 +84,7 @@ _TL;DR_: A comprehensive guide to setting up a fully local and private language
|
||||
|
||||
This repository outlines the steps to run a server for running local language models. It uses Debian specifically, but most Linux distros should follow a very similar process. It aims to be a guide for Linux beginners like me who are setting up a server for the first time.
|
||||
|
||||
The process involves installing the requisite drivers, setting the GPU power limit, setting up auto-login, and scheduling the `init.bash` script to run at boot. All these settings are based on my ideal setup for a language model server that runs most of the day but a lot can be customized to suit your needs. For example, you can use any OpenAI-compatible server like [`llama.cpp`](https://github.com/ggerganov/llama.cpp) or [LM Studio](https://lmstudio.ai) instead of Ollama or vLLM.
|
||||
The process involves installing the requisite drivers, setting the GPU power limit, setting up auto-login, and scheduling the `init.bash` script to run at boot. All these settings are based on my ideal setup for a language model server that runs most of the day but a lot can be customized to suit your needs.
|
||||
|
||||
## Priorities
|
||||
|
||||
@@ -156,8 +157,10 @@ Docker is a containerization platform that allows you to run applications in iso
|
||||
|
||||
### HuggingFace CLI
|
||||
|
||||
📖 [**Documentation**](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli)
|
||||
|
||||
> [!NOTE]
|
||||
> This prerequisite only applies to those who want to use vLLM as an inference engine.
|
||||
> Only needed for llama.cpp/vLLM.
|
||||
|
||||
- Create a new virtual environment:
|
||||
```
|
||||
@@ -168,6 +171,7 @@ Docker is a containerization platform that allows you to run applications in iso
|
||||
```
|
||||
pip install -U "huggingface_hub[cli]"
|
||||
```
|
||||
- Create an authentication token on https://huggingface.com
|
||||
- Log in to HF Hub:
|
||||
```
|
||||
huggingface-cli login
|
||||
@@ -180,8 +184,36 @@ Docker is a containerization platform that allows you to run applications in iso
|
||||
|
||||
The result should be your username.
|
||||
|
||||
> [!TIP]
|
||||
> If you intend to install vLLM via Python (manually) instead of Docker, and you don't intend to use HuggingFace with any other backend aside from vLLM, I'd recommend using the same virtual environment for `huggingface_hub` and `vllm`, say `vllm-env`.
|
||||
#### Managing Models
|
||||
|
||||
Models can be downloaded either to the default location (`.cache/huggingface/hub`) or to any local directory you specify. Where the model is stored can be defined using the `--local-dir` command line flag. Not specifying this will result in the model being stored in the default location. Storing the model in the folder where the packages for the inference engine are stored is good practice - this way, everything you need to run inference on a model is stored in the same place. However, if you use the same models with multiple backends frequently (e.g. using Qwen_QwQ-32B-Q4_K_M.gguf with both llama.cpp and vLLM), the default location is probably best.
|
||||
|
||||
First, activate the virtual environment that contains `huggingface_hub`:
|
||||
```
|
||||
source hf-env/bin/activate
|
||||
```
|
||||
|
||||
#### Download Models
|
||||
|
||||
Models are downloaded using their HuggingFace tag. Here, we'll use bartowski/Qwen_QwQ-32B-GGUF as an example. To download a model, run:
|
||||
```
|
||||
huggingface-cli download bartowski/Qwen_QwQ-32B-GGUF Qwen_QwQ-32B-Q4_K_M.gguf --local-dir models
|
||||
```
|
||||
Ensure that you are in the correct directory when you run this.
|
||||
|
||||
#### Delete Models
|
||||
|
||||
To delete a model in the specified location, run:
|
||||
```
|
||||
rm <model_name>
|
||||
```
|
||||
|
||||
To delete a model in the default location, run:
|
||||
```
|
||||
huggingface-cli delete-cache
|
||||
```
|
||||
|
||||
This will start an interactive session where you can remove models from the HuggingFace directory. In case you've been saving models in a different location than `.cache/huggingface`, deleting models from there will free up space but the metadata will remain in the HF cache until it is deleted properly. This can be done via the above command but you can also simply delete the model directory from `.cache/huggingface/hub`.
|
||||
|
||||
## General
|
||||
Update the system by running the following commands:
|
||||
@@ -190,8 +222,6 @@ sudo apt update
|
||||
sudo apt upgrade
|
||||
```
|
||||
|
||||
## Drivers
|
||||
|
||||
Now, we'll install the required GPU drivers that allow programs to utilize their compute capabilities.
|
||||
|
||||
**Nvidia GPUs**
|
||||
@@ -305,10 +335,14 @@ When the server boots up, we want it to automatically log in to a user account a
|
||||
|
||||
## Inference Engine
|
||||
|
||||
The inference engine is one of the primary components of this setup. It is code that takes model files containing weights and makes it possible to get useful outputs from them. This guide allows a choice between Ollama and vLLM - both are mature, production-grade inference engines with different priorities and stengths.
|
||||
The inference engine is one of the primary components of this setup. It is code that takes model files containing weights and makes it possible to get useful outputs from them. This guide allows a choice between llama.cpp, vLLM, and Ollama - all of these are popular inference engines with different priorities and stengths (note: Ollama uses llama.cpp under the hood and is simply a CLI wrapper). It can be daunting to jump straight into the deep end with command line arguments in llama.cpp and vLLM. If you're a power user and enjoy the flexibility afforded by tight control over serving parameters, using either llama.cpp or vLLM will be a wonderful experience and really come down to the quantization format you decide. However, if you're a beginner or aren't yet comfortable with this, Ollama can be convenient stopgap while you build the skills you need or the very end of the line if you decide your current level of knowledge is enough!
|
||||
|
||||
### Ollama
|
||||
|
||||
🌟 [**GitHub**](https://github.com/ollama/ollama)
|
||||
📖 [**Documentation**](https://github.com/ollama/ollama/tree/main/docs)
|
||||
🔧 [**Engine Arguments**](https://github.com/ollama/ollama/blob/main/docs/modelfile.md)
|
||||
|
||||
Ollama will be installed as a service, so it runs automatically at boot.
|
||||
|
||||
- Download Ollama from the official repository:
|
||||
@@ -337,12 +371,66 @@ We want our API endpoint to be reachable by the rest of the LAN. For Ollama, thi
|
||||
> [!TIP]
|
||||
> If you installed Ollama manually or don't use it as a service, remember to run `ollama serve` to properly start the server. Refer to [Ollama's troubleshooting steps](#ollama-3) if you encounter an error.
|
||||
|
||||
### llama.cpp
|
||||
|
||||
🌟 [**GitHub**](https://github.com/ggml-org/llama.cpp)
|
||||
📖 [**Documentation**](https://github.com/ggml-org/llama.cpp/tree/master/docs)
|
||||
🔧 [**Engine Arguments**](https://github.com/ggml-org/llama.cpp/tree/master/examples/server)
|
||||
|
||||
- Clone the llama.cpp GitHub repository:
|
||||
```
|
||||
git clone https://github.com/ggml-org/llama.cpp.git
|
||||
cd llama.cpp
|
||||
```
|
||||
- Build the binary:
|
||||
|
||||
**CPU**
|
||||
```
|
||||
cmake -B build
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
**CUDA**
|
||||
```
|
||||
cmake -B build -DGGML_CUDA=ON
|
||||
cmake --build build --config Release
|
||||
```
|
||||
For other systems looking to use Metal, Vulkan and other low-level graphics APIs, view the complete [llama.cpp build documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) to leverage accelerated inference.
|
||||
|
||||
### vLLM
|
||||
|
||||
🌟 [**GitHub**](https://github.com/vllm-project/vllm)
|
||||
📖 [**Documentation**](https://docs.vllm.ai/en/stable/index.html)
|
||||
🔧 [**Engine Arguments**](https://docs.vllm.ai/en/stable/serving/engine_args.html)
|
||||
|
||||
vLLM comes with its own OpenAI-compatible API that we can use just like Ollama. Where Ollama runs GGUF model files, vLLM can run AWQ, GPTQ, GGUF, BitsAndBytes, and safetensors (the default release type) natively.
|
||||
|
||||
> [!NOTE]
|
||||
> By default, vLLM uses HuggingFace's model download destination (`~/.cache/huggingface/hub`). Adding and removing models is easiest done via the HuggingFace CLI.
|
||||
**Manual Installation (Recommended)**
|
||||
|
||||
- Create a directory and virtual environment for vLLM:
|
||||
```
|
||||
mkdir vllm
|
||||
cd vllm
|
||||
python3 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
- Install vLLM using `pip`:
|
||||
```
|
||||
pip install vllm
|
||||
```
|
||||
|
||||
- Serve with your desired flags. It uses port 8000 by default, but I'm using port 8556 here so it doesn't conflict with any other services:
|
||||
```
|
||||
vllm serve <model> --port 8556
|
||||
```
|
||||
|
||||
- To use as a service, add the following block to `init.bash` to serve vLLM on startup:
|
||||
```
|
||||
source .venv/bin/activate
|
||||
vllm serve <model> --port 8556
|
||||
```
|
||||
> Replace `<model>` with your desired model tag, copied from HuggingFace.
|
||||
|
||||
**Docker Installation**
|
||||
|
||||
@@ -358,34 +446,7 @@ vLLM comes with its own OpenAI-compatible API that we can use just like Ollama.
|
||||
```
|
||||
> Replace `<your_hf_hub_token>` with your HuggingFace Hub token and `<model>` with your desired model tag, copied from HuggingFace.
|
||||
|
||||
**Manual Installation**
|
||||
|
||||
- Create and activate a new virtual environment for vLLM's packages:
|
||||
```
|
||||
python3 -m venv vllm-env
|
||||
source vllm-env/bin/activate
|
||||
```
|
||||
|
||||
- Install vLLM using `pip`:
|
||||
```
|
||||
pip install vllm
|
||||
```
|
||||
|
||||
- Serve vLLM with your desired flags. It uses port 8000 by default, but I'm using port 8556 here so it doesn't conflict with any other services:
|
||||
```
|
||||
vllm serve <model> --port 8556
|
||||
```
|
||||
|
||||
- To use as a service, add the following block to `init.bash` to serve vLLM on startup:
|
||||
```
|
||||
source vllm-env/bin/activate
|
||||
vllm serve <model> --port 8556
|
||||
```
|
||||
> Replace `<model>` with your desired model tag, copied from HuggingFace.
|
||||
|
||||
#### Serving a Different Model
|
||||
|
||||
Assuming you served the vLLM OpenAI-compatible API via Docker:
|
||||
To serve a different model:
|
||||
|
||||
- First stop the existing container:
|
||||
```
|
||||
@@ -409,63 +470,137 @@ Assuming you served the vLLM OpenAI-compatible API via Docker:
|
||||
--model <model>
|
||||
```
|
||||
|
||||
The reference for vLLM's engine arguments (context length, quantization type, etc.) can be found [here](https://docs.vllm.ai/en/v0.6.1/models/engine_args.html).
|
||||
### Creating a Service
|
||||
> [!NOTE]
|
||||
> Only needed for manual installations of llama.cpp/vLLM.
|
||||
|
||||
#### Managing Models
|
||||
While the above steps will help you get up and running with an OpenAI-compatible LLM server, they will not help with this server persisting after you close your terminal window or restart your physical server. Docker can achieve this with the `-d` (for "detach") flag but running vanilla Python servers is common. To do this, we must start the inference engine in a `.service` file that will run alongside the Linux operating system when booting, ensuring that it is available whenever the server is on.
|
||||
|
||||
Running the Docker command with a model name should download and then load your desired model. However, there are times where you may want to download from HuggingFace manually and delete old models to free up space.
|
||||
Let's call the service we're about to build `llm-server.service`. We'll assume all models are in the `models` child directory - you can change this as you need to.
|
||||
|
||||
First, activate the virtual environment that contains `huggingface_hub`:
|
||||
```
|
||||
source hf-env/bin/activate
|
||||
```
|
||||
1. Create the `systemd` service file:
|
||||
```bash
|
||||
sudo nano /etc/systemd/system/llm-server.service
|
||||
```
|
||||
|
||||
Models are downloaded using their HuggingFace tag. Here, we'll use Qwen/Qwen2-VL-2B-Instruct as an example. To download a model, run:
|
||||
```
|
||||
huggingface-cli download Qwen/Qwen2-VL-2B-Instruct
|
||||
```
|
||||
2. Configure the service file:
|
||||
|
||||
To delete a model, run:
|
||||
```
|
||||
huggingface-cli delete-cache
|
||||
```
|
||||
This will start an interactive session where you can remove models from the HuggingFace directory. In case you've been saving models in a different location than `.cache/huggingface`, deleting models from there will free up space but the metadata will remain in the HF cache until it is deleted properly. This can be done via the above command but you can also simply delete the model directory from `.cache/huggingface/hub`.
|
||||
**llama.cpp**
|
||||
```ini
|
||||
[Unit]
|
||||
Description=LLM Server Service
|
||||
After=network.target
|
||||
|
||||
You can find the HuggingFace CLI documentation [here](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli).
|
||||
[Service]
|
||||
User=<user>
|
||||
Group=<user>
|
||||
WorkingDirectory=/home/<user>/llama.cpp/build/bin/
|
||||
ExecStart=/home/<user>/llama.cpp/build/bin/llama-server \
|
||||
--port <port> \
|
||||
--host 0.0.0.0 \
|
||||
-m /home/<user>/llama.cpp/models/<model> \
|
||||
--no-webui # [other engine arguments]
|
||||
Restart=always
|
||||
RestartSec=10s
|
||||
|
||||
#### Open WebUI Integration
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
**vLLM**
|
||||
```ini
|
||||
[Unit]
|
||||
Description=LLM Server Service
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
User=<user>
|
||||
Group=<user>
|
||||
WorkingDirectory=/home/<user>/vllm/
|
||||
ExecStart=/bin/bash -c 'source .venv/bin/activate && vllm serve --port <port> --host 0.0.0.0 -m /home/<user>/vllm/models/<model>'
|
||||
Restart=always
|
||||
RestartSec=10s
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
> Replace `<user>`, `<port>`, and `<model>` with your Linux username, desired port for serving, and desired model respectively.
|
||||
|
||||
3. Reload the `systemd` daemon:
|
||||
```bash
|
||||
sudo systemctl daemon-reload
|
||||
```
|
||||
4. Run the service:
|
||||
|
||||
If `llm-server.service` doesn't exist:
|
||||
```
|
||||
sudo systemctl enable llm-server.service
|
||||
sudo systemctl start llm-server
|
||||
```
|
||||
|
||||
If `llm-server.service` does exist:
|
||||
```
|
||||
sudo systemctl restart llm-server
|
||||
```
|
||||
5. (Optional) Check the service's status:
|
||||
```bash
|
||||
sudo systemctl status llama-server
|
||||
```
|
||||
|
||||
### Open WebUI Integration
|
||||
> [!NOTE]
|
||||
> Only needed for llama.cpp/vLLM.
|
||||
|
||||
Navigate to `Admin Panel > Settings > Connections` and set the following values:
|
||||
|
||||
- Enable OpenAI API
|
||||
- API Base URL: `http://host.docker.internal:8556/v1`
|
||||
- API Base URL: `http://host.docker.internal:<port>/v1`
|
||||
- API Key: `anything-you-like`
|
||||
|
||||
### Comparison
|
||||
### Ollama vs. llama.cpp
|
||||
|
||||
Why you may pick Ollama over vLLM:
|
||||
- **Simplicity**: Ollama's one-click install process and CLI to manage models is easier to use.
|
||||
- **Open WebUI first-class citizen**: The UI supports Ollama actions like pulling and removing models natively.
|
||||
- **Efficient GPU-CPU splitting**: Running models that don't fit entirely within the GPU is trivial.
|
||||
- **Best GGUF implementation**: Runs GGUF better than any other inference engine, thanks to `llama.cpp`.
|
||||
| **Aspect** | **Ollama (Wrapper)** | **llama.cpp (Vanilla)** |
|
||||
| -------------------------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
|
||||
| **Installation/Setup** | One-click install & CLI model management | Requires manual setup/configuration |
|
||||
| **Open WebUI Integration** | First-class citizen | Requires OpenAI-compatible endpoint setup |
|
||||
| **Model Switching** | Native model-switching via server | Requires manual port management or [llama-swap](https://github.com/mostlygeek/llama-swap) |
|
||||
| **Customizability** | Limited: Modelfiles are cumbersome | Full control over parameters via CLI |
|
||||
| **Transparency** | Defaults may override model parameters (e.g., context length) | Full transparency in parameter settings |
|
||||
| **GGUF Support** | Inherits llama.cpp's best-in-class implementation | Best GGUF implementation |
|
||||
| **GPU-CPU Splitting** | Inherits llama.cpp's efficient splitting | Trivial GPU-CPU splitting out-of-the-box |
|
||||
|
||||
Why you may pick vLLM over Ollama:
|
||||
- **Supports vision models**: vLLM already supports vision LMs like Qwen 2.5 VL and LLaMA 3.2. This feature is not supported by `llama.cpp`, but Ollama has separately implemented support (only for `llama3.2-vision` as of Jan 12, 2025).
|
||||
- **Faster GPU inference**: Using GGUFs vs. native safetensors leaves performance on the table. vLLM is also better for distributed inference across multiple GPUs.
|
||||
- **Broader compatibility**: vLLM can run model quantization types (AWQ, GPTQ, BnB, etc.) aside from just GGUF.
|
||||
---
|
||||
|
||||
You can always keep both and decide to spin them up at your discretion. Using Ollama as a service offers no degradation in experience because unused models are offloaded from VRAM after some time. Using vLLM as a service keeps a model in memory, so I wouldn't use this alongside Ollama in an automated, always-on fashion unless it was your primary inference engine. Essentially,
|
||||
### vLLM vs. Ollama/llama.cpp
|
||||
| **Feature** | **vLLM** | **Ollama/llama.cpp** |
|
||||
| ----------------------- | -------------------------------------------- | ------------------------------------------------------------------------------------- |
|
||||
| **Vision Models** | Supports Qwen 2.5 VL, Llama 3.2 Vision, etc. | Ollama supports some vision models, llama.cpp does not support any (via llama-server) |
|
||||
| **Quantization** | Supports AWQ, GPTQ, BnB, etc. | Only supports GGUF |
|
||||
| **Multi-GPU Inference** | Yes | Yes |
|
||||
| **Tensor Parallelism** | Yes | No |
|
||||
|
||||
In summary,
|
||||
|
||||
- **Ollama**: Best for those who want an "it just works" experience.
|
||||
- **llama.cpp**: Best for those who want total control over their inference servers and are familiar with engine arguments.
|
||||
- **vLLM**: Best for those who want (i) to run non-GGUF quantizations of models, (ii) multi-GPU inference using tensor parallelism, or (iii) to use vision models.
|
||||
|
||||
Using Ollama as a service offers no degradation in experience because unused models are offloaded from VRAM after some time. Using vLLM or llama.cpp as a service keeps a model in memory, so I wouldn't use this alongside Ollama in an automated, always-on fashion unless it was your primary inference engine. Essentially,
|
||||
|
||||
| Primary Engine | Secondary Engine | Run SE as service? |
|
||||
| -------------- | ---------------- | ------------------ |
|
||||
| Ollama | vLLM | No |
|
||||
| vLLM | Ollama | Yes |
|
||||
| Ollama | llama.cpp/vLLM | No |
|
||||
| llama.cpp/vLLM | Ollama | Yes |
|
||||
|
||||
|
||||
## Chat Platform
|
||||
|
||||
### Open WebUI
|
||||
|
||||
Open WebUI is a web-based interface for managing Ollama models and chats, and provides a beautiful, performant UI for communicating with your models. You will want to do this if you want to access your models from a web interface. If you're fine with using the command line or want to consume models through a plugin/extension, you can skip this step.
|
||||
🌟 [**GitHub**](https://github.com/open-webui/open-webui)
|
||||
📖 [**Documentation**](https://docs.openwebui.com)
|
||||
|
||||
Open WebUI is a web-based interface for managing models and chats, and provides a beautiful, performant UI for communicating with your models. You will want to do this if you want to access your models from a web interface. If you're fine with using the command line or want to consume models through a plugin/extension, you can skip this step.
|
||||
|
||||
To install without Nvidia GPU support, run the following command:
|
||||
```
|
||||
@@ -477,7 +612,7 @@ For Nvidia GPUs, run the following command:
|
||||
sudo docker run -d -p 3000:8080 --gpus all --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:cuda
|
||||
```
|
||||
|
||||
You can access it by navigating to `http://localhost:3000` in your browser or `http://<server_IP>:3000` from another device on the same network. There's no need to add this to the `init.bash` script as Open WebUI will start automatically at boot via Docker Engine.
|
||||
You can access it by navigating to `http://localhost:3000` in your browser or `http://<server_ip>:3000` from another device on the same network. There's no need to add this to the `init.bash` script as Open WebUI will start automatically at boot via Docker Engine.
|
||||
|
||||
Read more about Open WebUI [here](https://github.com/open-webui/open-webui).
|
||||
|
||||
@@ -491,6 +626,8 @@ Read more about Open WebUI [here](https://github.com/open-webui/open-webui).
|
||||
|
||||
### OpenedAI Speech
|
||||
|
||||
🌟 [**GitHub**](https://github.com/matatonic/openedai-speech)
|
||||
|
||||
OpenedAI Speech is a text-to-speech server that wraps [Piper TTS](https://github.com/rhasspy/piper) and [Coqui XTTS v2](https://docs.coqui.ai/en/latest/models/xtts.html) in an OpenAI-compatible API. This is great because it plugs in easily to the Open WebUI interface, giving your models the ability to speak their responses.
|
||||
|
||||
> As of v0.17 (compared to v0.10), OpenedAI Speech features a far more straightforward and automated Docker installation, making it easy to get up and running.
|
||||
@@ -561,17 +698,10 @@ We'll use Piper here because I haven't found any good resources for high quality
|
||||
```
|
||||
> Replace `<openedai_speech_container_ID>` and `<open_webui_container_ID>` with the container IDs you identified.
|
||||
|
||||
#### Open WebUI Integration
|
||||
|
||||
Navigate to `Admin Panel > Settings > Audio` and set the following values:
|
||||
|
||||
- Text-to-Speech Engine: `OpenAI`
|
||||
- API Base URL: `http://host.docker.internal:8000/v1`
|
||||
- API Key: `anything-you-like`
|
||||
- Set Model: `tts-1` (for Piper) or `tts-1-hd` (for XTTS)
|
||||
|
||||
### Kokoro FastAPI
|
||||
|
||||
🌟 [**GitHub**](https://github.com/remsky/Kokoro-FastAPI)
|
||||
|
||||
Kokoro FastAPI is a text-to-speech server that wraps around and provides OpenAI-compatible API inference for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M), a state-of-the-art TTS model. The documentation for this project is fantastic and covers most, if not all, of the use cases for the project itself.
|
||||
|
||||
To install Kokoro-FastAPI, run
|
||||
@@ -583,16 +713,24 @@ sudo docker compose up --build
|
||||
|
||||
The server can be used in two ways: an API and a UI. By default, the API is served on port 8880 and the UI is served on port 7860.
|
||||
|
||||
#### Open WebUI Integration
|
||||
### Open WebUI Integration
|
||||
|
||||
Navigate to `Admin Panel > Settings > Audio` and set the following values:
|
||||
|
||||
**OpenedAI Speech**
|
||||
- Text-to-Speech Engine: `OpenAI`
|
||||
- API Base URL: `http://host.docker.internal:8000/v1`
|
||||
- API Key: `anything-you-like`
|
||||
- Set Model: `tts-1` (for Piper) or `tts-1-hd` (for XTTS)
|
||||
|
||||
**Kokoro FastAPI**
|
||||
- Text-to-Speech Engine: `OpenAI`
|
||||
- API Base URL: `http://host.docker.internal:8880/v1`
|
||||
- API Key: `anything-you-like`
|
||||
- Set Model: `kokoro`
|
||||
- Response Splitting: None (this is crucial - Kokoro uses a novel audio splitting system)
|
||||
|
||||
The server can be used in two ways: an API and a UI. By default, the API is served on port 8880 and the UI is served on port 7860.
|
||||
|
||||
### Comparison
|
||||
|
||||
@@ -612,6 +750,9 @@ Kokoro's performance makes it an ideal candidate for regular use as a voice assi
|
||||
|
||||
### ComfyUI
|
||||
|
||||
🌟 [**GitHub**](https://github.com/comfyanonymous/ComfyUI)
|
||||
📖 [**Documentation**](https://docs.comfy.org)
|
||||
|
||||
ComfyUI is a popular open-source graph-based tool for generating images using image generation models such as Stable Diffusion XL, Stable Diffusion 3, and the Flux family of models.
|
||||
|
||||
- Clone and navigate to the repository:
|
||||
@@ -708,20 +849,20 @@ On the client:
|
||||
Setting up a firewall is essential for securing your server. The Uncomplicated Firewall (UFW) is a simple and easy-to-use firewall for Linux. You can use UFW to allow or deny incoming and outgoing traffic to and from your server.
|
||||
|
||||
- Install UFW:
|
||||
```
|
||||
```bash
|
||||
sudo apt install ufw
|
||||
```
|
||||
- Allow SSH, HTTPS, and any other ports you need:
|
||||
```bash
|
||||
sudo ufw allow ssh https 80 8080 # [your other services' ports]
|
||||
```
|
||||
sudo ufw allow ssh https 3000 11434 8556 80 8000 8080 8188
|
||||
```
|
||||
Here, we're allowing SSH (port 22), HTTPS (port 443), Open WebUI (port 3000), Ollama (port 11434), vLLM (port 8556), HTTP (port 80), OpenedAI Speech (8000), Docker (port 8080), and ComfyUI (port 8188). You can add or remove ports as needed.
|
||||
Here, we're allowing SSH (port 22), HTTPS (port 443), HTTP (port 80), Docker (port 8080) to start. You can add or remove ports as needed. Ensure to add ports for services that you end up using - both from this guide and in general.
|
||||
- Enable UFW:
|
||||
```
|
||||
```bash
|
||||
sudo ufw enable
|
||||
```
|
||||
- Check the status of UFW:
|
||||
```
|
||||
```bash
|
||||
sudo ufw status
|
||||
```
|
||||
|
||||
@@ -732,7 +873,7 @@ Refer to [this guide](https://www.digitalocean.com/community/tutorials/how-to-se
|
||||
|
||||
## Remote Access
|
||||
|
||||
Remote access refers to the ability to access your server outside of your home network. For example, when you leave the house, you aren't going to be able to access `http://(your_server_ip)`, because your network has changed from your home network to some other network (either your mobile carrier's or a local network in some other place). This means that you won't be able to access the services running on your server. There are many solutions on the web that solve this problem and we'll explore some of the easiest-to-use here.
|
||||
Remote access refers to the ability to access your server outside of your home network. For example, when you leave the house, you aren't going to be able to access `http://<your_server_ip>`, because your network has changed from your home network to some other network (either your mobile carrier's or a local network in some other place). This means that you won't be able to access the services running on your server. There are many solutions on the web that solve this problem and we'll explore some of the easiest-to-use here.
|
||||
|
||||
### Tailscale
|
||||
|
||||
@@ -783,69 +924,36 @@ To use a Mullvad exit on one of your devices, first find the exit node you want
|
||||
|
||||
This section isn't strictly necessary by any means - if you use all the elements in the guide, a good experience in Open WebUI means you've succeeded with the goal of the guide. However, it can be helpful to test the disparate installations at different stages in this process.
|
||||
|
||||
> [!NOTE]
|
||||
> Depending on the machine you conduct these verification tests from, remember to interchange your server's IP address and `localhost` as required.
|
||||
### Inference Engine
|
||||
|
||||
### Ollama
|
||||
|
||||
To test your Ollama installation and endpoint, run:
|
||||
To test your OpenAI-compatible server endpoint, run:
|
||||
```
|
||||
curl http://localhost:11434/api/generate -d '{
|
||||
curl http://localhost:<port>/v1/completions -d '{
|
||||
"model": "llama2",
|
||||
"prompt":"Why is the sky blue?"
|
||||
}'
|
||||
```
|
||||
> Replace `llama2` with your preferred model.
|
||||
|
||||
Since the OLLAMA_HOST environment variable is set to 0.0.0.0, it's easy to access Ollama from anywhere on the network. To do so, simply update the `localhost` reference in your URL or command to match the IP address of your server.
|
||||
|
||||
Refer to [Ollama's REST API docs](https://github.com/ollama/ollama/blob/main/docs/api.md) for more information on the entire API.
|
||||
|
||||
### vLLM
|
||||
|
||||
> [!NOTE]
|
||||
> This example assumes that vLLM runs on port 8556. Change it with your desired port if different.
|
||||
|
||||
To test your vLLM installation and endpoint, run:
|
||||
```
|
||||
curl http://localhost:8556/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "facebook/opt-125m",
|
||||
"prompt": "San Francisco is a",
|
||||
"max_tokens": 7,
|
||||
"temperature": 0
|
||||
}
|
||||
```
|
||||
> Replace `facebook/opt-125m` with your preferred model.
|
||||
> Replace `<port>` with the actual port of your server and `llama2` with your preferred model. If your physical server is different than the machine you're executing the above command on, replace `localhost` with the IP of the physical server.
|
||||
|
||||
### Open WebUI
|
||||
|
||||
Visit `http://localhost:3000`. If you're greeted by the authentication page, you've successfully installed Open WebUI.
|
||||
|
||||
### OpenedAI Speech
|
||||
### Text-to-Speech Server
|
||||
|
||||
To test your TTS server and endpoint, run the following command:
|
||||
To test your TTS server, run the following command:
|
||||
```
|
||||
curl -s http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
|
||||
curl -s http://localhost:<port>/v1/audio/speech -H "Content-Type: application/json" -d '{
|
||||
"input": "The quick brown fox jumped over the lazy dog."}' > speech.mp3
|
||||
```
|
||||
|
||||
If you see the `speech.mp3` file in the OpenedAI Speech directory, you should be good to go. If you're paranoid like I am, test it using a player like `aplay`. Run the following commands (from the OpenedAI Speech directory):
|
||||
If you see the `speech.mp3` file in the directory you ran the command from, you should be good to go. If you're paranoid, test it using a player like `aplay`. Run the following commands:
|
||||
```
|
||||
sudo apt install aplay
|
||||
aplay speech.mp3
|
||||
```
|
||||
|
||||
### Kokoro FastAPI
|
||||
|
||||
To test the API, run:
|
||||
```
|
||||
curl -s http://localhost:8880/v1/audio/speech -H "Content-Type: application/json" -d '{
|
||||
"input": "The quick brown fox jumped over the lazy dog."}' > speech.mp3
|
||||
```
|
||||
|
||||
To test the web UI, visit `http://localhost:7860`.
|
||||
Kokoro FastAPI: To test the web UI, visit `http://localhost:7860`.
|
||||
|
||||
### ComfyUI
|
||||
|
||||
@@ -879,11 +987,22 @@ Rerun the command that installs Ollama - it acts as an updater too:
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
```
|
||||
|
||||
### llama.cpp
|
||||
|
||||
Enter your llama.cpp folder and run the following commands:
|
||||
```
|
||||
cd llama.cpp
|
||||
git pull
|
||||
# Rebuild according to your setup - uncomment `-DGGML_CUDA=ON` for CUDA support
|
||||
cmake -B build # -DGGML_CUDA=ON
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
### vLLM
|
||||
|
||||
For a manual installation, enter your virtual environment and update via `pip`:
|
||||
```
|
||||
source vllm-env/bin/activate
|
||||
source vllm/.venv/bin/activate
|
||||
pip install vllm --upgrade
|
||||
```
|
||||
|
||||
@@ -1015,8 +1134,9 @@ This is my first foray into setting up a server and ever working with Linux so t
|
||||
- I chose Debian because it is, apparently, one of the most stable Linux distros. I also went with an XFCE desktop environment because it is lightweight and I wasn't yet comfortable going full command line.
|
||||
- Use a user for auto-login, don't log in as root unless for a specific reason.
|
||||
- To switch to root in the command line without switching users, run `sudo -i`.
|
||||
- If something using a Docker container doesn't work, try running `sudo docker ps -a` to see if the container is running. If it isn't, try running `sudo docker compose up -d` again. If it is and isn't working, try running `sudo docker restart (container_ID)` to restart the container.
|
||||
- If something using a Docker container doesn't work, try running `sudo docker ps -a` to see if the container is running. If it isn't, try running `sudo docker compose up -d` again. If it is and isn't working, try running `sudo docker restart <container_id>` to restart the container.
|
||||
- If something isn't working no matter what you do, try rebooting the server. It's a common solution to many problems. Try this before spending hours troubleshooting. Sigh.
|
||||
- While it takes some time to get comfortable with, using an inference engine like llama.cpp and vLLM (as compared to Ollama) is really the way to go to squeeze the maximum performance out of your hardware. If you're reading this guide in the first place and haven't already thrown up your hands and used a cloud provider, it's a safe assumption that you care about the ethos of hosting all this stuff locally. Thus, get your experience as close to a cloud provider as it can be by optimizing your server.
|
||||
|
||||
### Hardware
|
||||
|
||||
@@ -1073,9 +1193,7 @@ Docs:
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
Cheers to all the fantastic work done by the open-source community. This guide wouldn't exist without the effort of the many contributors to the projects and guides referenced here.
|
||||
|
||||
To stay up-to-date on the latest developments in the field of machine learning, LLMs, and other vision/speech models, check out [r/LocalLLaMA](https://reddit.com/localllama).
|
||||
Cheers to all the fantastic work done by the open-source community. This guide wouldn't exist without the effort of the many contributors to the projects and guides referenced here. To stay up-to-date on the latest developments in the field of machine learning, LLMs, and other vision/speech models, check out [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/).
|
||||
|
||||
> [!NOTE]
|
||||
> Please star any projects you find useful and consider contributing to them if you can. Stars on this guide would also be appreciated if you found it helpful, as it helps others find it too.
|
||||
Reference in New Issue
Block a user