If you're writing an integration with Ollama or want to influence its function on your device, there are several environment variables you can set to configure how Ollama operates.
Source (2025-01-29): https://github.com/ollama/ollama/blob/main/envconfig/config.go
OLLAMA_HOST- defaults to http://127.0.0.1:11434. Can be http/https. Host returns the scheme and host. Host can be configured via the OLLAMA_HOST environment variable.
OLLAMA_ORIGINS- comma separated values. Always appends http(s)://localhost, 127.0.0.1, 0.0.0.0, app://, file://, tauri://, vscode-webview://. Origins returns a list of allowed origins. Origins can be configured via the OLLAMA_ORIGINS environment variable.
OLLAMA_MODELS- Models returns the path to the models directory. Models directory can be configured via the OLLAMA_MODELS environment variable. Default is $HOME/.ollama/models
OLLAMA_KEEP_ALIVE- KeepAlive returns the duration that models stay loaded in memory. KeepAlive can be configured via the OLLAMA_KEEP_ALIVE environment variable. Negative values are treated as infinite. Zero is treated as no keep alive. Default is 5 minutes.
OLLAMA_LOAD_TIMEOUT- LoadTimeout returns the duration for stall detection during model loads. LoadTimeout can be configured via the OLLAMA_LOAD_TIMEOUT environment variable. Zero or Negative values are treated as infinite. Default is 5 minutes.
Boolean flags. It accepts 1, t, T, TRUE, true, True, 0, f, F, FALSE, false, False. Any other value returns an error. (https://pkg.go.dev/strconv#ParseBool)
OLLAMA_DEBUG - Debug enabled additional debug information.
OLLAMA_FLASH_ATTENTION - FlashAttention enables the experimental flash attention feature.
OLLAMA_KV_CACHE_TYPE - KvCacheType is the quantization type for the K/V cache.
OLLAMA_NOHISTORY - NoHistory disables readline history.
OLLAMA_NOPRUNE - NoPrune disables pruning of model blobs on startup.
OLLAMA_SCHED_SPREAD - SchedSpread allows scheduling models across all GPUs.
OLLAMA_INTEL_GPU - IntelGPU enables experimental Intel GPU detection.
OLLAMA_MULTIUSER_CACHE - MultiUserCache optimizes prompt caching for multi-user scenarios
Library Settings
OLLAMA_LLM_LIBRARY - Set LLM library to bypass autodetection
CUDA_VISIBLE_DEVICES - Set which NVIDIA devices are visible
HIP_VISIBLE_DEVICES - Set which AMD devices are visible by numeric ID
ROCR_VISIBLE_DEVICES - Set which AMD devices are visible by UUID or numeric ID
GPU_DEVICE_ORDINAL - Set which AMD devices are visible by numeric ID
HSA_OVERRIDE_GFX_VERSION - Override the gfx used for all detected AMD GPUs
Performance Settings
OLLAMA_NUM_PARALLEL - Maximum number of parallel requests. Defaults to unlimited
OLLAMA_MAX_LOADED_MODELS - Maximum number of loaded models per GPU. Defaults to unlimited.
OLLAMA_MAX_QUEUE - Maximum number of queued requests. Defaults to 512.
OLLAMA_MAX_VRAM - Maximum amount of VRAM that can be consumed per GPU. Default is unlimited
OLLAMA_GPU_OVERHEAD - Set aside VRAM per GPU. Default is 0
Proxy Settings
HTTP_PROXY - HTTP proxy
HTTPS_PROXY - HTTPS proxy
NO_PROXY - No proxy
Top comments (0)