The traditional web services handle quick stateless requests measured in milliseconds, but LLM inference workloads require routing decisions that understand the current state of each model server — queue depth, memory utilization, loaded adapters, and ability to handle requests of different criticality levels. InferRoute adds a Featherless-powered classifier in front of that routing layer, and a band.ai cost-control agent watching the budget knob continuously.In Istio it fully conformant with GIE v1.0.1 and promotes GIE support to beta — separating platform admin concerns from AI/ML team concerns through InferencePool (platform admin resource) and InferenceObjective (AI/ML owner resource). InferRoute rides exactly that wave, adding a Featherless intelligence layer and band.ai budget governance on top of the freshest K8s-native inference primitive. It's not a toy demo — it maps directly to the production problem of routing requests based on KV-cache utilization, pending queue depth, and prefix affinity to minimize time-to-first-token, with cost control that enterprises actually need
Category tags: