Skip to content

fix(core): self-heal etcd meta watch on transport reconnect#3062

Merged
imbajin merged 3 commits into
apache:masterfrom
dpol1:fix/3036-meta-listener-reconnect
Jun 23, 2026
Merged

fix(core): self-heal etcd meta watch on transport reconnect#3062
imbajin merged 3 commits into
apache:masterfrom
dpol1:fix/3036-meta-listener-reconnect

Conversation

@dpol1

@dpol1 dpol1 commented Jun 22, 2026

Copy link
Copy Markdown
Member

Purpose of the PR

CachedSchemaTransactionV2 registers a JVM-global watch on the meta store so that a schema change on one node clears the schema cache on the other nodes. That watch goes through EtcdMetaDriver.listen / listenPrefix, which handed jetcd the bare Consumer<WatchResponse> overload. That overload discards onError and onCompleted, so when jetcd ends a watch (e.g. after a transport reconnect tears down the gRPC stream) the listener stopped receiving events with no log line and no exception. The node kept serving stale schema and nothing reported it. PdMetaDriver does not have this problem because its KvClient already re-subscribes on error.

Main Changes

  • EtcdMetaDriver.listen / listenPrefix now register a Watch.Listener instead of the bare Consumer overload, surfacing both onError and onCompleted.
    • onError only logs at WARN level. jetcd 0.5.9 (WatchImpl.WatcherImpl.handleError) already retries recoverable errors internally by notifying onError and rescheduling resume() on the same watcher; re-subscribing from onError would therefore open a duplicate watch on every transient reconnect.
    • onCompleted schedules a re-subscribe after a 1 s backoff on a daemon thread. onCompleted is jetcd's terminal-close signal: the old watcher is already removed and will not recover, so a replacement watch is safe and necessary. This mirrors the self-heal PdMetaDriver already gets from KvClient.
    • If the re-subscribe itself throws (e.g. the endpoint is still unreachable), it is retried with the same backoff instead of abandoning recovery after a single failure.
  • Removed CachedSchemaTransactionV2.resetMetaListenerForReconnect(). It was a manual stopgap added in fix(server): sync hstore schema cache clears #3011 with no callers, and it never shipped in a tagged release. Now that the driver self-heals, the JVM-global register-once flag is correct to stay set; the manual reset has nothing left to do. The lifecycle comment on metaEventListenerRegistered is updated to describe the self-heal behaviour.
  • The MetaDriver interface is unchanged, so neither implementor needs edits beyond EtcdMetaDriver.

Known limitation: the re-subscribe opens a fresh watch without a stored revision, so any cache-clear events emitted during the short reconnect window are not replayed. This is the same behaviour as PdMetaDriver / KvClient. The change removes the permanent silent failure; it does not add gap replay.

Verifying these changes

  • Need tests and can be verified as follows:
    • New EtcdMetaDriverTest (JUnit + Mockito) mocks the jetcd Client/Watch via the package-private test constructor (no live etcd needed) and asserts:
      1. onCompleted triggers a re-subscribe (a second watch(...) call).
      2. onError does not trigger a re-subscribe (only logs), so no duplicate watch is opened.
      3. The re-created prefix watch preserves the original key and isPrefix WatchOption.
      4. onNext still delivers events to the consumer.
      5. A re-watch that throws once is retried and eventually succeeds (no permanent loss).
    • Local run: EtcdMetaDriverTest 5/5, CachedSchemaTransactionTest 18/18, MetaManagerSchemaCacheClearEventTest 6/6 after the stopgap removal.
    • Red-green: commenting out the re-subscribe in onCompleted makes the recovery tests fail; restoring it makes them pass.

Does this PR potentially affect the following parts?

  • Dependencies
  • Modify configurations
  • The public API
  • Other affects (typed here)
  • Nope

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - No Need

dpol1 added 2 commits June 22, 2026 18:35
)

EtcdMetaDriver.listen/listenPrefix handed jetcd a bare Consumer<WatchResponse>,
so a terminal watch error (e.g. after a transport reconnect) was swallowed and
the JVM-global schema-cache-clear listener died silently: a node stopped
receiving cross-node cache-clear events with no error or warning.

Switch to the Watch.Listener overload and re-subscribe on onError/onCompleted
via a daemon-backed backoff, mirroring the self-heal PdMetaDriver already gets
from KvClient. The driver watch now stays live across reconnects, so
CachedSchemaTransactionV2's register-once flag staying true is correct; the
unused resetMetaListenerForReconnect stopgap and its TODO are removed.

Add EtcdMetaDriverTest covering re-subscribe on error/completion and event
delivery, and register it in UnitTestSuite.
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working tests Add or improve test cases labels Jun 22, 2026
imbajin

This comment was marked as low quality.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 22, 2026

@imbajin imbajin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found in the etcd watch recovery path: retryable jetcd watch errors can now create duplicate active watchers, and a failed delayed re-watch attempt can stop recovery permanently. Please fix these before merging; I also left one non-blocking test-hardening note.

@dosubot dosubot Bot removed the lgtm This PR has been approved by a maintainer label Jun 23, 2026
Address review on apache#3062:

- jetcd 0.5.9 WatcherImpl.handleError already retries recoverable errors by
  notifying onError and rescheduling resume() on the same watcher. Re-subscribing
  from onError therefore opened a duplicate watch on every transient reconnect.
  Re-subscribe now happens only from onCompleted, jetcd's terminal-close signal,
  where the old watcher is already removed; onError only logs.
- Guard the scheduled re-watch: if the re-subscribe itself throws (endpoint still
  unreachable), it is retried with the same backoff instead of abandoning recovery
  after a single failure.
- Strengthen tests: assert onError does not re-watch, assert the re-created prefix
  watch preserves the key and prefix WatchOption, and cover a re-watch that throws
  once then succeeds.
@dpol1 dpol1 requested a review from imbajin June 23, 2026 07:50

@imbajin imbajin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: no. Summary: No obvious issues found in the current head. Evidence: git diff --check origin/master...HEAD; EtcdMetaDriverTest passed 5 tests.

Non-blocking follow-ups:

  • Please update the PR description to match the latest implementation: onError now only logs because jetcd retries recoverable errors itself, while re-watch is scheduled from terminal onCompleted; the test count is now 5, not 4.
  • Consider reducing recoverable watch onError logging from WARN with stack trace to INFO/DEBUG or a rate-limited log, since jetcd is expected to self-retry those transient failures.
  • If HugeGraph treats public core methods as a compatibility surface, consider keeping CachedSchemaTransactionV2.resetMetaListenerForReconnect() as a deprecated no-op or documented fallback. Given it was an uncalled stopgap that did not ship in a tagged release, I do not consider this blocking.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 23, 2026
@imbajin imbajin merged commit 398a5ed into apache:master Jun 23, 2026
15 of 16 checks passed
@dpol1 dpol1 deleted the fix/3036-meta-listener-reconnect branch June 23, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files. tests Add or improve test cases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improvement] Re-register HStore schema cache clear listener after a Meta reconnect

2 participants