[dss_bench] Tool to generate automatic graphs for q/s based on various parameters#1519
[dss_bench] Tool to generate automatic graphs for q/s based on various parameters#1519the-glu wants to merge 1 commit into
Conversation
| from monitoring.dss_bench.tests.base import BenchTest | ||
|
|
||
|
|
||
| def discover() -> dict[str, type[BenchTest]]: |
There was a problem hiding this comment.
Since this is already used twice, is it worth making a more reusable generic?
| def discover() -> dict[str, type[BenchTest]]: | |
| def discover[T]() -> Iterable[type[T]]: |
(with usage {bt.name: bt for bt in discover()})
| scopes: list[str] = [] | ||
| default: bool = True |
There was a problem hiding this comment.
name is probably sufficiently self-documenting, but I'm not sure what these are from inspection and this is a base class that will be used in (presumably) a number of places -- let's document what these are.
|
|
||
| try: | ||
| test.setup(session, base_url) | ||
| except Exception: |
There was a problem hiding this comment.
This will prevent even the user from cancelling execution with KeyboardInterrupt; it seems like we should be much narrower in the exceptions we catch. What exceptions would we want to accept and continue for here? Wouldn't we expect the setup to work, and want to stop a test as probably invalid if the setup wasn't successful?
| test.action(session, base_url) | ||
| latencies_ms.append((time.monotonic() - t0) * 1000.0) | ||
| done += 1 | ||
| except Exception: |
There was a problem hiding this comment.
This seems like an overbroad catch; could we just use query_and_describe to catch the right exceptions in the right circumstances and then check whether the query succeeded?
|
|
||
|
|
||
| def run_test( | ||
| test: BenchTest, targets: list[tuple[str, str]], cfg: GlobalConfig |
There was a problem hiding this comment.
It's hard to figure out what "targets" is, requiring tracing though the code; let's just make a simple data structure so it's super clear:
@dataclass
class Target:
base_url: str
audience: str| test: BenchTest, targets: list[tuple[str, str]], cfg: GlobalConfig | |
| test: BenchTest, targets: list[Target], cfg: GlobalConfig |
...but, it doesn't seem like carrying audience is even necessary since it's a function of the base URL (using an AuthAdapter/UTMClientSession will take care of this automatically).
|
|
||
| Each DSS node is published on the host at port 80<NN> where NN is the | ||
| 2-digit global node index, and validates JWTs whose audience equals its | ||
| hostname dss<j>.uss<i>.localutm. We therefore hit http://localhost:80NN |
There was a problem hiding this comment.
The audience should be determinable by the FQDN, so we should just have DSS instances accept localhost as an audience like mock_uss instances do (multiple audiences are fine).
| # survivorship bias of percentiles computed over successes only. | ||
| with_errors = merged + merged_errors | ||
|
|
||
| return { |
Follow #1518
This PR adds a new tool to generate meaningful graphs to compare the performance of various scenarios.
As of now, we do have Locust tests. They serve some purposes (mainly variations over time), but using them to validate performance can be time-consuming and prone to error. We also have a tendency to use various, incompatible parameters between tests.
An extra consideration is the fact that CockroachDB data is distributed differently between every run, meaning that tests with NUM_USS and NUM_NODE greater than one must average performance across every DSS, not just the first one.
The framework proposed here aims to measure performance as a single point: no change over time, and in theory, each test cleans up after itself. Example: a test that creates and deletes a single operational intent (included here as an example).
Then, we add a variant, which represents the X-axis of our graphs. These could be multiple; for example: the number of existing subscriptions, or the number of workers. This PR includes an inter-USS latency context as an example.
Finally, an option is available to compare different images or different datastores, with the idea of doing comparisons (for example, in a PR against master, or to compare performance between datastores, which will be needed for Raft).
The framework automatically cleans up and runs 'start-locally' for every data point, then produces a graph. A JSON file is also stored for future use.
The test is executed against all DSS at the same time and averaged.
Example graph with latest version:
This allows us to generate useful graphs, like this one showing how latency heavily impacts queries as simple as RID operational intents:
(⚠️ This graph has been generated before displaying errors)
Another example comparing the current master and the latest release on RID:
(⚠️ This graph has been generated before displaying errors)
This shows small variations (at least in terms of QPS), probably explained by the fact that I ran it on my machine while other processes were running. Note that tests should probably be run on a dedicated machine, free from external influences as much as possible. The graph shown there are only for demonstration.
Notice that a run can take a significant amount of time, especially with database initialization on high latencies.
This PR is a first test, goal is to add more tests or variant in future PR, especially a RID ISA with one subscription, and one SCD test (based on flightinsubs).