Local-inference runtime tunable from environment variables

10 May 2026 v1.9.0 improvement

performance
deployment

Self-hosted operators can now tune the bundled local-inference runtime directly from .env. CPU and memory limits, the maximum number of concurrently loaded models, the request queue depth, and flash-attention kernel use are all configurable without editing compose files. The GPU overlay ships its own GPU-friendly defaults so they do not conflict with CPU .env settings.

Quick presets for small VPS, medium server, workstation, and large GPU deployments are documented in .env.example and the developer guide.

No action required. Defaults match the prior release.