Links
to use for reference but not include in writing
to include in page
casey and ai expert: https://www.youtube.com/watch?v=Sp1EmFRDquA
llm wiki idea: https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
what to do with all this https://www.dbreunig.com/2026/05/04/10-lessons-for-agentic-coding.htmlhttps://www.dbreunig.com/2024/02/01/pursuing-quiet-ai.htmlhttps://steipete.me/posts/2025/shipping-at-inference-speed
prediction/observation https://www.dbreunig.com/2026/03/26/winchester-mystery-house.html
approach https://mariozechner.at/posts/2026-03-25-thoughts-on-slowing-the-fuck-down/
tooling https://eugeneyan.com/writing/working-with-ai/https://www.dbreunig.com/2024/10/18/the-3-ai-use-cases-gods-interns-and-cogs.html#cogs
https://buttondown.com/ultradune/archive/eval-008-nvidia-just-open-sourced-an-inference/
The inference stack is being decomposed. A year ago, you picked one engine and it handled everything. Now we have specialized layers — execution engines (vLLM, SGLang, TGI), orchestration frameworks (Dynamo), structured generation engines (XGrammar), and quantization toolchains (TorchAO). Each layer is independently optimizable.
I need to look into: XGrammar + jump-forward decoding
https://read.theaimerge.com/p/the-smartest-ai-engineers-will-bet
Across the industry, only a small number of teams have managed to move beyond pilots and demos. And when systems fail, it’s rarely because of the model itself. It’s the engineering around the model: how systems are designed, monitored, tested, and improved over time. These are the same problems software teams have always faced, but made harder this time, due to the non-deterministic behavior of AI Systems.
https://jarvislabs.ai/blog/expert-parallelism-mixed-strategies-vllm
https://read.theaimerge.com/p/understanding-llm-inference
https://www.vectara.com/glossary-of-llm-terms
https://huggingface.co/collections/nityan/mustread-papers
https://blog.ngxson.com/easier-to-understand-what-is-transformer
https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora/
https://rajatpandit.com/ai-infrastructure/the-integer-moment/
https://rajatpandit.com/insights/
https://huggingface.co/docs/transformers/en/kv_cache
tools and projects https://github.com/Tiiny-AI/PowerInferhttps://github.com/microsoft/LLMLingua
neat summary on mini max model insights https://www.linkedin.com/posts/sebastianraschka_the-minimax-m2-series-was-one-of-the-most-share-7465419259985174529-p3ub/
ArXiv In-depth Analysis – Medium
need to try bringing these in
https://www.databricks.com/blog/llm-inference-performance-engineering-best-practiceshttps://newsletter.pragmaticengineer.com/p/what-is-inference-engineeringhttps://github.com/ikawrakow/ik_llama.cpp/blob/main/docs/parameters.md
nice overview, not sure where to put https://www.weka.io/learn/guide/ai-ml/what-is-llm/https://www.clarifai.com/blog/llm-model-architecture/https://huggingface.co/blog/dvilasuero/choosing-best-open-source-ai-modelshttps://medium.com/@sahin.samia/llm-inference-optimization-techniques-a-comprehensive-analysis-1c434e85ba7c
THIS https://read.theaimerge.com/p/understanding-llm-optimization-techniques
should see if I want to add these in https://read.theaimerge.com/p/the-ai-engineers-guide-to-inferencehttps://read.theaimerge.com/p/understanding-llm-inferencehttps://jarvislabs.ai/blog/vllm-sglang-trtllm-comparisonhttps://buttondown.com/ultradune/archive/eval-007-the-great-moe-shift-how-mixture-of/https://bhavishyapandit9.substack.com/p/deep-dive-into-quantization-of-llmshttps://www.clarifai.com/blog/llm-inference-optimization/
I want to make a section on link dump for tech depths at some point
What I Wish Someone Had Told Me About Tensor Computation Libraries | George HoWhat's a Tensor? - YouTubeUnderstanding ATen: PyTorch's tensor library | Red Hat Developerggml/docs/gguf.md at master · ggml-org/ggml
GGML Deep Dive I: Environment Setup | xsxszab.github.io - GGML Deep Dive VII: Tensor Representaion and Memory Layout | xsxszab.github.io
vLLM Deep Dive I: Intro and Environment Setup | xsxszab.github.io
Local LLM Inference : llama.cpp, GGUF, Quantizations and GGML Explained
The AI/ML Engineer's starter guide to GPU Programming
Very simple to understand: RoPE, 2D-RoPE, M-RoPE | ngxson's blog
Introduction to ggml | ngxson's blog
https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
https://thinkingmachines.ai/blog/lora/
https://aman.ai/primers/ai/token-sampling/
https://builtin.com/artificial-intelligence/transformer-neural-network
https://hamzaelshafie.bearblog.dev/blog/
https://themlsurgeon.substack.com/p/transformer-attention-backwards
https://huggingface.co/blog/Kseniase/testtimecompute
would be neat to train one day
https://github.com/karpathy/nanochatBEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute
Adaptive LLM Routing Under Budget Constraints
Think Anywhere in Code Generation
SkillOpt: Self-Evolving Agent Skills