Skip to content

Support gateways with multiple replicas#3960

Merged
jvstme merged 2 commits into
masterfrom
gateway_replicas
Jun 16, 2026
Merged

Support gateways with multiple replicas#3960
jvstme merged 2 commits into
masterfrom
gateway_replicas

Conversation

@jvstme

@jvstme jvstme commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

A gateway can now have multiple replicas for
improved availability.

type: gateway
name: example-gateway

backend: aws
region: eu-west-1

domain: example.com

certificate: null
replicas: 2

To balance requests between gateway replicas, add
DNS records for each replica or set up a load
balancer outside of dstack. Replica hostnames
are displayed in dstack CLI and UI.

$ dstack gateway list
 NAME             BACKEND          HOSTNAME        DOMAIN       DEFAULT  STATUS
 example-gateway                                   example.com  ✓        running
    replica=0     aws (eu-west-1)  34.244.128.46
    replica=1     aws (eu-west-1)  18.201.201.174

Limitations:

  • Changing the number of replicas or redeploying
    replicas is not supported.
  • HTTPS is not supported. Use an external load
    balancer for TLS termination.
  • An unavailable gateway replica prevents any new
    services or service replicas from being added.
  • All replicas are bound to the same backend and
    region.
  • At most 3 replicas are allowed per gateway.

Implementation notes:

  • GatewayComputeModel now represents a gateway
    replica.
  • In this version, the terms "compute" and
    "replica" are used interchangeably. The plan is
    to switch to using exclusively "replica" later.
  • In this version, replica provisioning and
    termination are still done in the gateway
    pipeline, for all replicas at once. The plan is
    to introduce gateway replica pipelines later to
    allow for independent replica processing.

#3959

A gateway can now have multiple replicas for
improved availability.

```yaml
type: gateway
name: example-gateway

backend: aws
region: eu-west-1

domain: example.com

certificate: null
replicas: 2
```

To balance requests between gateway replicas, add
DNS records for each replica or set up a load
balancer outside of `dstack`. Replica hostnames
are displayed in `dstack` CLI and UI.

```shell
$ dstack gateway list
 NAME             BACKEND          HOSTNAME        DOMAIN       DEFAULT  STATUS
 example-gateway                                   example.com  ✓        running
    replica=0     aws (eu-west-1)  34.244.128.46
    replica=1     aws (eu-west-1)  18.201.201.174
```

Limitations:
- Changing the number of replicas or redeploying
  replicas is not supported.
- HTTPS is not supported. Use an external load
  balancer for TLS termination.
- An unavailable gateway replica prevents any new
  services or service replicas from being added.
- All replicas are bound to the same backend and
  region.

Implementation notes:
- `GatewayComputeModel` now represents a gateway
  replica.
- In this version, the terms "compute" and
  "replica" are used interchangeably. The plan is
  to switch to using exclusively "replica" later.
- In this version, replica provisioning and
  termination are still done in the gateway
  pipeline, for all replicas at once. The plan is
  to introduce gateway replica pipelines later to
  allow for independent replica processing.
@jvstme jvstme requested a review from r4victor June 12, 2026 01:10
logger.debug(
"%s replica %d: creating gateway compute", fmt(gateway_model), replica_num
)
gateway_compute_model = await gateways_services.create_gateway_compute(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if one replica fails but others are provisioned – there needs to be a clean up of successfully provisioned replicas.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gateway will enter the failed status, but any successfully provisioned replicas will remain in the database and can be deleted with dstack gateway delete.

This is consistent with the existing handling of single-replica gateway provisioning failures. If dstack creates an instance for a gateway but later fails to connect to it, the instance is not cleaned up automatically and can only be removed along with the failed gateway using dstack gateway delete.

While this behavior may be counterintuitive and worth revisiting, I think it will be easier to address after gateway replica statuses and pipelines are introduced, which I plan to implement in the next iteration.

" Set to `null` to disable. Defaults to `type: lets-encrypt`"
),
] = LetsEncryptGatewayCertificate()
replicas: Annotated[

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put a technical upper bound on the number of provisioned replicas, e.g. 20 (to avoid provisioning 1000 replicas in a loop).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a very conservative limit that has been tested and has a higher chance of working well. Added on the server side so that it can be revisited as the implementation improves

Comment on lines +630 to +632
stats = await conn.get_stats(project_name, run_name)
if stats is None: # Stats not fetched yet
return None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any one replica goes unavailable, it breaks autoscaling. I expect this needs to be fixed for HA so worth adding a TODO/FIXME.

" NOTE: if you just updated dstack from pre-0.19.25 to 0.19.25+,"
" expect to see this warning once for every running service replica"
),
for conn in connections:

@r4victor r4victor Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if some gateway replicas are registered and some are not – it seems the job won't be considered registered and won't be unregistered from the succeeded gateway replicas. Same applies to service registration.

Not sure about the consequences.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dangling service replicas are possible, but they shouldn't have any visible effect. Such replicas will be terminated by dstack (since registration failed on one of the gateway replicas), so successful gateway replicas won't be able to connect to them, and Nginx won't forward any requests to them. The dangling replicas will be removed from the gateway when the service is unregistered.

As for dangling services, they were already possible before if a service unregistration request failed. They are cleaned up when a service with the same name is registered again.

This is worth addressing, but I think it can wait until the state synchronization mechanism is introduced, so we don't have to implement distributed transaction orchestration that will soon become obsolete

@jvstme jvstme requested a review from r4victor June 16, 2026 00:15
@jvstme jvstme merged commit c2f32dd into master Jun 16, 2026
24 checks passed
@jvstme jvstme deleted the gateway_replicas branch June 16, 2026 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants