Superhuman, powering over 40 million daily users, teamed with Databricks to replace their homegrown large language model serving stack. Together they built an inference platform capable of sustaining over 200,000 queries per second with guaranteed sub-second P99 latency and four nines of reliability. This new system addresses key scaling, load balancing, and operational challenges to streamline developer workflows and optimize cloud cost efficiency.
- Maintains sub-second P99 latency at peak 200K+ QPS with 99.99% uptime
- Introduces intelligent load balancing via Endpoint Discovery Service to prevent hotspots
- Implements dynamic, asymmetric autoscaling tuned through joint shadow testing
Infrastructure signal
To support a peak inference load exceeding 200,000 queries per second with strict latency and reliability service level objectives, Superhuman and Databricks co-designed a new serving infrastructure. A major innovation was creating a lightweight control plane, the Endpoint Discovery Service (EDS), which continuously monitors Kubernetes endpoints and enables a custom load balancing strategy based on the “power of two choices.” This method ensures requests are routed to the least loaded pods, mitigating latency spikes from uneven load distribution that traditional round-robin balancing exhibits under high load.
In addition to high-throughput load balancing optimization, the system incorporates an autoscaling approach that adjusts dynamically to traffic fluctuations. Scaling strategies are deliberately asymmetric: aggressive scale-ups react swiftly to rapid traffic surges, while conservative scale-downs prevent disruptive oscillations and maintain tail latency stability. Container readiness times and image pulling delays were addressed to avoid latency degradation during scaling events, ensuring a smooth ramp-up experience during peak demand periods.
Developer impact
The shift from a DIY serving stack maintained by an internal machine learning infrastructure team to a managed Databricks model serving platform significantly reduces operational burdens on developers. This transition eliminates months-long manual performance tuning and capacity planning efforts previously required for every model iteration, freeing developer time to prioritize enhancing model fidelity and user-facing features instead of infrastructure firefighting.
Joint product and engineering collaboration between Superhuman and Databricks ensured alignment on ambitious latency and quality SLAs from the outset. Extensive shadow testing enabled iterative tuning of autoscaling parameters and load balancing logic, increasing confidence in platform stability. This partnership model illustrates how embedding infrastructure expertise within the development lifecycle fosters reliability and enables faster deployment of cutting-edge AI capabilities at scale.
What teams should watch
Teams managing real-time AI inference workloads at scale should closely observe the combination of custom Kubernetes endpoint monitoring (via an Endpoint Discovery Service) and the power-of-two-choices load balancing algorithm. This approach can significantly reduce tail latency issues caused by hotspots when traffic rapidly fluctuates. Furthermore, implementing carefully balanced autoscaling policies that prioritize swift scale-up with cautious scale-down is critical to avoiding latency spikes during traffic transitions.
Cloud cost optimization benefits from this dynamic scaling and load balancing synergy, as infrastructure can right-size itself efficiently without risking quality regressions or SLA violations. Developer teams should also consider collaborative infrastructure partnerships that embed joint ownership of latency and reliability targets, supporting continuous tuning of platform parameters through real-world traffic shadow testing to quickly identify and resolve edge cases.