Recent Updates
We ship improvements weekly. Here's what we've been building.
Unified Activity Feed
Console activity is now a single workload + inference stream instead of a billing-only feed, so operators can see request flow, infra events, and model activity in one timeline.
Org Residency + Multi-Provider Routing
Routing now combines org residency controls with provider-aware selection and managed-first fallback. This landed alongside expanded proprietary route support and safer adapter error handling in streaming paths.
37-Model Self-Hosted Catalogue
The self-hosted catalogue expanded to 37 models with cleaner canonical naming and undeployed model filtering, giving teams a larger deployable set without exposing unfinished inventory.
Preview/Staging Infrastructure Hardening
Preview environments now use wildcard TLS, improved metadata-sidecar startup ordering, and automated cluster bootstrap. Staging control plane migration and cost optimization updates are also in place.
APAC Coverage Expansion
Proprietary routing now includes Thailand and Malaysia regions, with geo-nearest fallback-chain fixes so requests remain residency-aware while still failing over safely across APAC.
Model UX Refresh
Model catalogue and detail pages were overhauled with improved sorting, copyable IDs, better quickstart examples, clearer availability indicators, and stronger mobile responsiveness.
Data-Driven Provider Routing
Provider routing moved to Redis-backed data configuration with per-region adapter seeding, enabling faster model/provider updates without code deploys and cleaner operations runbooks.
Admin Console
Operator dashboard now live, real-time health, GPU agent visibility, cross-org deployment tracking, and live inference metrics. Role-gated to admin users.
Auth Hardening
Fixed a session endpoint that was inadvertently skipping JWT validation in non-local environments. User roles now resolved from the database, not JWT claims.
Phase 2 Go-Live
Cleared final production blockers: context-length formatting fixed for ≥1M token models, image tags pinned across all Helm charts, orchestrator cross-cluster config wired up.
18-Model Managed Catalogue
Router catalogue finalised at 18 models, Llama 4 Scout, Qwen 3.5 32B, BGE-M3 embeddings, Llama 3.2 Vision, and more. Every model has a dedicated GPU tier, pricing spec, and latency grade.
Router Fee Transparency
The 5.5% routing fee is now a first-class field in the pricing API. Two-debit reconciliation separates user charges from internal margin, clean audit trail from day one.
vLLM v0.17.1 + Prefix Caching
Upgraded from v0.6.6 to v0.17.1 with KV-cache prefix caching enabled by default. Agents and chat workloads with repeated system prompts see meaningful TTFT improvements.
Gemini on Router
Google Gemini is now routable via the APAC Router alongside Claude and open-weight models. Provider normalisation handled in the gateway, same API, no client changes needed.
HTTP/2 Router Pooling
Go router now maintains persistent HTTP/2 connection pools to inference backends. Eliminates per-request TLS handshakes, p50 router overhead drops from ~300ms to ~12ms.
Cross-Cluster Orchestration
Orchestrator can now manage vLLM pods on a separate data-plane cluster from a central control plane. Groundwork for active-active multi-region without per-region control planes.
Phase 2 Platform Launch
The APAC Model Router and unified inference API are live. One OpenAI-compatible endpoint for proprietary and open models, 18 managed open models, GPU workloads, and embeddings. Streaming, tool use, and vision supported.
TTFB Metering & Geo Attribution
Time-to-first-byte and client geo-attribution now captured on every inference request. Latency percentiles visible in the console, per-model, per-city, graded from Excellent (<50ms) to Poor (>300ms).
Team Accounts
Organisations are now a first-class concept. Invite teammates, share API keys, and manage billing under a single account. New members are auto-assigned to your org on first login.
v1.0.0 Production Baseline
Clean-slate production baseline cut and deployed. Backend, router, orchestrator, and console all on stable release tags. Zero-downtime rolling updates enabled, no more maintenance windows for deploys.
Model Weight Loader
vLLM pods now pull model weights from GCS on startup via a dedicated init container, Workload Identity means no credentials in the spec. 8B models cold-start in under 3 minutes; 70B in ~9.
Kubernetes Actuator
Orchestrator now creates and destroys vLLM Deployment + Service pairs directly via the K8s API. Pod lifecycle is fully managed: startup → health probing → scale-to-zero → wake-on-request.
Circuit Breaker per Backend
Go router now tracks per-backend failure rates and opens a circuit on sustained errors. Unhealthy backends are excluded from routing decisions without manual intervention.
Go Router Scaffolding
Inference router rebuilt in Go for lower overhead and tighter connection pooling. Model registry lives in Redis with a gRPC + HTTP/JSON bridge. Stateless, horizontally scalable from day one.
Latency Probe Network
Distributed probe workers now measure p50/p95/p99 TTFT from multiple APAC cities and write results back via a service-token-authenticated ingest endpoint. Console dashboards pull live data.
API Key Management
Full API key lifecycle in the console, create, label, set rate limits, and revoke programmatic keys. Per-key sliding-window rate limiting enforced at the edge before any inference work starts.
Async Metering Pipeline
Billing events now flow through NATS before landing in Postgres. Decouples inference latency from billing writes, high-throughput bursts no longer cause request slowdown.
Provider Gateway Alpha
Anthropic and OpenAI adapters live in a new provider gateway service. Normalises tool use, streaming, and model-ID remapping across providers, one integration pattern for all of them.
Wallet Pre-flight Checks
Inference requests now check wallet balance before routing. Zero-balance requests fail fast with a clear 402 rather than completing and leaving a negative balance.
Tensor Parallel on 70B Models
Building on our vLLM production milestone earlier this month, tensor parallel inference is now stable across multi-GPU nodes. Llama 70B and Qwen 32B deploy cleanly across 2× A100s.
Embedding Models
BGE Large, BGE-M3, and multilingual E5 now available as managed embedding models. Same OpenAI-compatible `/v1/embeddings` endpoint, drop-in replacement for teams moving off US-based providers.
Usage Analytics
Token counts, cost breakdown, and latency percentiles now in the console. Filter by model, date range, or API key. First step toward per-project budgets.
Rate Limiting Hardened
Per-API-key token windows tightened and now enforced at the edge. Sliding-window algorithm replaces fixed buckets, burst-friendly but protects against runaway loops.
