Skip to content

also may be good to talk a little more about wrapping and orchestration tools for the inference server here

Enterprise production focused wrappers exist as well like Nvidia's Dynamo.

https://buttondown.com/ultradune/archive/eval-008-nvidia-just-open-sourced-an-inference/https://github.com/ai-dynamo/dynamo

also edera engineer blog on gpu issues at scale over time

to use for reference but not include in writing

to include in page

https://www.glukhov.org/llm-hosting/llama-cpp/llama-server-router-mode/