Why is Queue Time more important than Backend Request Time to auto scaling?

It seems like Backend Request latency is sufficient for auto scaling purposes. This post explores why that isn't true and why we must look deeper at Queue Time.

Why is Queue Time more important than Backend Request Time to auto scaling?

What is Queue Time?

Queue time is a part of the backend response time of an HTTP request. Specifically, it is the time between a load balancer and application code. A small amount of queue time is to be expected (up to 10ms) but excessive queue time is a sign of a larger issue with your application.

What is Backend Response Time?

Backend response time is the total time it takes for a request to be processed by your application. We can look at the full backend response life-cycle to see the various steps. Requests to your application hit an AWS Load Balancer and are directed to a random Heroku router. The router directs requests to the dynos associated with your application. More detail on Heroku routing can be found in their documentation. Routing diagram for Heroku applications

What impacts Queue Time on Heroku?

Insufficient Dynos

Heroku's routers will try to randomly assign requests to available dynos. If there are no available dyno, requests will queue up and increase queue time. This is the most common reason for increased request queuing.

Insufficient Web Server Processes/Threads

Web servers like gunicorn or puma leverage processes or threads to make the most out of a Dyno's resources (CPU/Memory). Start with 3-5 processes/threads and adjust accordingly. Ideally, you don't see R14 or R15 errors on the metrics page of your application in Heroku.

Puma thread count can be adjusted in your application's Procfile or in a configuration file.

For gunicorn, the number of processes can be set by adjusting the WEB_CONCURRENCY environment variable.

For other web servers you can check the Heroku documentation.

Slow App Code

In cases of extremely slow app code execution, it's possible for all threads/processes on all dynos to be unavailable. Hypothetically, if we envision an application that takes 30 seconds to execute app code, running on 2 dynos with 3 processes each. Then the theoretical maximum throughput of the application is 6 simultaneous requests per 30 second period. Any additional requests in that period would queue up and increase queue time.

This scenario can be diagnosed by checking for H12 errors on the metrics page of your Heroku application. Generally, scaling up your application dynos would not help in this situation. Often, these errors can be avoided by moving code to asynchronous worker processes like Sidekiq or Python-RQ.

Why use Queue Time to auto scale?

Backend Request Time is Queue Time plus App Time. Heroku's Auto Scaling functionality for Performance dynos uses Backend Request Time to determine scaling. Unfortunately, because it includes App Time, performance issues in App code trigger false positive scaling which costs you more in unneeded dynos.

Backend Request Time = Queue Time + App Time

By eliminating App Time from the auto scaling calculation and focusing on Queue Time your application will only scale when your current dyno count is insufficient to meet the needs of your traffic.

Request life cycle for Heroku applications
Launch with Dynoscale Today!
Launch Dynoscale for your Heroku Application and start saving up to 50% of your dyno costs today!
Heroku CLI
heroku addons:create dscale
Click to Copy