Skip to content

to use for reference but not include in writing

to include in page

casey and ai expert: https://www.youtube.com/watch?v=Sp1EmFRDquA

llm wiki idea: https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f

what to do with all this https://www.dbreunig.com/2026/05/04/10-lessons-for-agentic-coding.htmlhttps://www.dbreunig.com/2024/02/01/pursuing-quiet-ai.htmlhttps://steipete.me/posts/2025/shipping-at-inference-speed

prediction/observation https://www.dbreunig.com/2026/03/26/winchester-mystery-house.html

approach https://mariozechner.at/posts/2026-03-25-thoughts-on-slowing-the-fuck-down/

tooling https://eugeneyan.com/writing/working-with-ai/https://www.dbreunig.com/2024/10/18/the-3-ai-use-cases-gods-interns-and-cogs.html#cogs

https://buttondown.com/ultradune/archive/eval-008-nvidia-just-open-sourced-an-inference/

The inference stack is being decomposed. A year ago, you picked one engine and it handled everything. Now we have specialized layers — execution engines (vLLM, SGLang, TGI), orchestration frameworks (Dynamo), structured generation engines (XGrammar), and quantization toolchains (TorchAO). Each layer is independently optimizable.

I need to look into: XGrammar + jump-forward decoding

https://read.theaimerge.com/p/the-smartest-ai-engineers-will-bet

Across the industry, only a small number of teams have managed to move beyond pilots and demos. And when systems fail, it’s rarely because of the model itself. It’s the engineering around the model: how systems are designed, monitored, tested, and improved over time. These are the same problems software teams have always faced, but made harder this time, due to the non-deterministic behavior of AI Systems.

https://jarvislabs.ai/blog/expert-parallelism-mixed-strategies-vllm

https://read.theaimerge.com/p/understanding-llm-inference

https://www.vectara.com/glossary-of-llm-terms

https://huggingface.co/collections/nityan/mustread-papers

https://blog.ngxson.com/easier-to-understand-what-is-transformer

https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora/

https://rajatpandit.com/ai-infrastructure/the-integer-moment/

https://rajatpandit.com/insights/

https://huggingface.co/docs/transformers/en/kv_cache

tools and projects https://github.com/Tiiny-AI/PowerInferhttps://github.com/microsoft/LLMLingua

neat summary on mini max model insights https://www.linkedin.com/posts/sebastianraschka_the-minimax-m2-series-was-one-of-the-most-share-7465419259985174529-p3ub/

ArXiv In-depth Analysis – Medium

xlite-dev/Awesome-LLM-Inference: 📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

chenhongyu2048/LLM-inference-optimization-paper: Summary of some awesome work for optimizing LLM inference

need to try bringing these in

Compare ik_llama.cpp, vLLM, llama.cpp, and ktransformers engines · Issue #11 · ubergarm/r1-ktransformers-guide

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practiceshttps://newsletter.pragmaticengineer.com/p/what-is-inference-engineeringhttps://github.com/ikawrakow/ik_llama.cpp/blob/main/docs/parameters.md

nice overview, not sure where to put https://www.weka.io/learn/guide/ai-ml/what-is-llm/https://www.clarifai.com/blog/llm-model-architecture/https://huggingface.co/blog/dvilasuero/choosing-best-open-source-ai-modelshttps://medium.com/@sahin.samia/llm-inference-optimization-techniques-a-comprehensive-analysis-1c434e85ba7c

THIS https://read.theaimerge.com/p/understanding-llm-optimization-techniques

should see if I want to add these in https://read.theaimerge.com/p/the-ai-engineers-guide-to-inferencehttps://read.theaimerge.com/p/understanding-llm-inferencehttps://jarvislabs.ai/blog/vllm-sglang-trtllm-comparisonhttps://buttondown.com/ultradune/archive/eval-007-the-great-moe-shift-how-mixture-of/https://bhavishyapandit9.substack.com/p/deep-dive-into-quantization-of-llmshttps://www.clarifai.com/blog/llm-inference-optimization/

https://medium.com/@jsshankar/demystifying-production-inference-serving-for-large-language-models-in-2026-7cfeea701b53

What I Wish Someone Had Told Me About Tensor Computation Libraries | George HoWhat's a Tensor? - YouTubeUnderstanding ATen: PyTorch's tensor library | Red Hat Developerggml/docs/gguf.md at master · ggml-org/ggml

GGML Deep Dive I: Environment Setup | xsxszab.github.io - GGML Deep Dive VII: Tensor Representaion and Memory Layout | xsxszab.github.io

vLLM Deep Dive I: Intro and Environment Setup | xsxszab.github.io

Local LLM Inference : llama.cpp, GGUF, Quantizations and GGML Explained

The AI/ML Engineer's starter guide to GPU Programming

Very simple to understand: RoPE, 2D-RoPE, M-RoPE | ngxson's blog

Introduction to ggml | ngxson's blog

https://youtu.be/vW30o4U9BFE

https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

https://thinkingmachines.ai/blog/lora/

https://aman.ai/primers/ai/token-sampling/

https://aman.ai/

https://builtin.com/artificial-intelligence/transformer-neural-network

https://hamzaelshafie.bearblog.dev/blog/

https://themlsurgeon.substack.com/p/transformer-attention-backwards

https://blog.ml.cmu.edu/

https://huggingface.co/blog/Kseniase/testtimecompute

would be neat to train one day

https://github.com/karpathy/nanochatBEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute
Adaptive LLM Routing Under Budget Constraints
Think Anywhere in Code Generation
SkillOpt: Self-Evolving Agent Skills