Research Engineer, LLM Inference

Sage Ahrac

I am a Research Engineer at IBM Research and an MSc student at Tel Aviv University, fortunate to be advised by Mor Geva. I work on distributed LLM inference serving, especially KV-cache management, scheduling, and cache-aware routing. Separately, I am drawn to mechanistic questions around Mixture-of-Experts routing geometry and latent-space monitoring. I also like building small cloud apps and agentic tools that sometimes work.

Selected work

Router-expert geometry in sparse MoEs

Preprint on geometric coupling between routers and experts in sparse mixture-of-experts models.

arXiv

GPU-less render serving in vLLM

Added vllm launch render, highlighted in the vLLM v0.18.0 release.

v0.18.0 PR

LoRA-aware KV-cache routing in llm-d

Added LoRA-aware prefix-cache routing for multi-adapter inference.

Blog RFC

Sparse tensors in Daft

Created sparse tensor delta encoding for autonomous-driving data in Daft.

Daft post Medium