Skip to content

[dss_bench] Tool to generate automatic graphs for q/s based on various parameters#1519

Open
the-glu wants to merge 1 commit into
interuss:mainfrom
Orbitalize:dss_bench
Open

[dss_bench] Tool to generate automatic graphs for q/s based on various parameters#1519
the-glu wants to merge 1 commit into
interuss:mainfrom
Orbitalize:dss_bench

Conversation

@the-glu

@the-glu the-glu commented Jun 18, 2026

Copy link
Copy Markdown
Member

Follow #1518

This PR adds a new tool to generate meaningful graphs to compare the performance of various scenarios.

As of now, we do have Locust tests. They serve some purposes (mainly variations over time), but using them to validate performance can be time-consuming and prone to error. We also have a tendency to use various, incompatible parameters between tests.
An extra consideration is the fact that CockroachDB data is distributed differently between every run, meaning that tests with NUM_USS and NUM_NODE greater than one must average performance across every DSS, not just the first one.

The framework proposed here aims to measure performance as a single point: no change over time, and in theory, each test cleans up after itself. Example: a test that creates and deletes a single operational intent (included here as an example).

Then, we add a variant, which represents the X-axis of our graphs. These could be multiple; for example: the number of existing subscriptions, or the number of workers. This PR includes an inter-USS latency context as an example.

Finally, an option is available to compare different images or different datastores, with the idea of doing comparisons (for example, in a PR against master, or to compare performance between datastores, which will be needed for Raft).

The framework automatically cleans up and runs 'start-locally' for every data point, then produces a graph. A JSON file is also stored for future use.

The test is executed against all DSS at the same time and averaged.

Example graph with latest version:

image

This allows us to generate useful graphs, like this one showing how latency heavily impacts queries as simple as RID operational intents:

image

(⚠️ This graph has been generated before displaying errors)

Another example comparing the current master and the latest release on RID:

image

(⚠️ This graph has been generated before displaying errors)

This shows small variations (at least in terms of QPS), probably explained by the fact that I ran it on my machine while other processes were running. Note that tests should probably be run on a dedicated machine, free from external influences as much as possible. The graph shown there are only for demonstration.

Notice that a run can take a significant amount of time, especially with database initialization on high latencies.

This PR is a first test, goal is to add more tests or variant in future PR, especially a RID ISA with one subscription, and one SCD test (based on flightinsubs).

Comment thread monitoring/loadtest/locust_files/RID.py Outdated
Comment thread monitoring/dss_bench/contexts/base.py Outdated
from monitoring.dss_bench.tests.base import BenchTest


def discover() -> dict[str, type[BenchTest]]:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is already used twice, is it worth making a more reusable generic?

Suggested change
def discover() -> dict[str, type[BenchTest]]:
def discover[T]() -> Iterable[type[T]]:

(with usage {bt.name: bt for bt in discover()})

Comment on lines +8 to +9
scopes: list[str] = []
default: bool = True

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name is probably sufficiently self-documenting, but I'm not sure what these are from inspection and this is a base class that will be used in (presumably) a number of places -- let's document what these are.

Comment thread monitoring/dss_bench/README.md

try:
test.setup(session, base_url)
except Exception:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will prevent even the user from cancelling execution with KeyboardInterrupt; it seems like we should be much narrower in the exceptions we catch. What exceptions would we want to accept and continue for here? Wouldn't we expect the setup to work, and want to stop a test as probably invalid if the setup wasn't successful?

test.action(session, base_url)
latencies_ms.append((time.monotonic() - t0) * 1000.0)
done += 1
except Exception:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an overbroad catch; could we just use query_and_describe to catch the right exceptions in the right circumstances and then check whether the query succeeded?



def run_test(
test: BenchTest, targets: list[tuple[str, str]], cfg: GlobalConfig

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to figure out what "targets" is, requiring tracing though the code; let's just make a simple data structure so it's super clear:

@dataclass
class Target:
    base_url: str
    audience: str
Suggested change
test: BenchTest, targets: list[tuple[str, str]], cfg: GlobalConfig
test: BenchTest, targets: list[Target], cfg: GlobalConfig

...but, it doesn't seem like carrying audience is even necessary since it's a function of the base URL (using an AuthAdapter/UTMClientSession will take care of this automatically).


Each DSS node is published on the host at port 80<NN> where NN is the
2-digit global node index, and validates JWTs whose audience equals its
hostname dss<j>.uss<i>.localutm. We therefore hit http://localhost:80NN

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audience should be determinable by the FQDN, so we should just have DSS instances accept localhost as an audience like mock_uss instances do (multiple audiences are fine).

# survivorship bias of percentiles computed over successes only.
with_errors = merged + merged_errors

return {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants