also may be good to talk a little more about wrapping and orchestration tools for the inference server here
Enterprise production focused wrappers exist as well like Nvidia's Dynamo.
https://buttondown.com/ultradune/archive/eval-008-nvidia-just-open-sourced-an-inference/https://github.com/ai-dynamo/dynamo
also edera engineer blog on gpu issues at scale over time
Links
to use for reference but not include in writing
to include in page
https://www.glukhov.org/llm-hosting/llama-cpp/llama-server-router-mode/