Cloudflare has dramatically enhanced its Security Insights scanning capacity, achieving over a tenfold increase in throughput from 10 to more than 100 scans per second. This improvement allows for faster detection of security risks across millions of accounts and zones without extra hardware investments, addressing long-standing challenges around scan frequency and coverage.
- Throughput scaled to 120+ scans per second by batch processing and workload segregation.
- Kafka consumer patterns redesigned to maximize concurrency within partition order constraints.
- Postgres query optimizations and API stabilizations reduced backend bottlenecks.
Infrastructure signal
Cloudflare transformed its scanning infrastructure by addressing Apache Kafka's limitations in ordered message consumption. Instead of adding Kafka partitions, which would have increased broker resource usage, the team introduced batch consumption and parallel goroutine processing within each consumer. This allowed multiple scans to be processed concurrently while maintaining message order per partition.
Additionally, the architecture was split into 'fast lane' and 'slow lane' consumer groups based on expected processing time. This separation ensured that slow, resource-intensive scans would not block faster ones. Data persistence was optimized by improving Postgres query efficiency and stabilizing the internal API that stores scan results, reducing timeouts and crashes without increasing hardware.
Developer impact
Developers benefited from a more resilient and faster scanning backend that can handle increased load without system failures or frequent process crashes. Enabling batch message processing required adapting code to handle reprocessing after potential mid-batch crashes and managing increased memory usage, changes that improved overall throughput and fault tolerance.
The introduction of distinct consumer lanes simplified debugging and performance monitoring by isolating slow-processing workloads. Improved API responsiveness and reduced database query latencies lifted bottlenecks, providing developers with reliable and timely security insights that enable faster iteration and response times in their development workflows.
What teams should watch
Engineering teams should monitor metrics related to Kafka consumer lag and memory usage to ensure batch processing does not introduce unseen delays or resource saturation. Observability into consumer group performance split by 'fast lane' and 'slow lane' is critical to maintain balanced load distribution and identify any queueing issues that could regress scan frequency.
Database teams need to keep tracking query performance and Postgres health to prevent backend persistence bottlenecks. Since scan results are persisted in a central relational store, query optimization and index tuning remain essential for handling increased throughput. API teams should also watch for request timeouts and error rates as throughput expands, ensuring that scaling improvements are smoothly maintained.