fix(core): self-heal etcd meta watch on transport reconnect by dpol1 · Pull Request #3062 · apache/hugegraph

dpol1 · 2026-06-22T17:00:58Z

Purpose of the PR

close [Improvement] Re-register HStore schema cache clear listener after a Meta reconnect #3036

CachedSchemaTransactionV2 registers a JVM-global watch on the meta store so that a schema change on one node clears the schema cache on the other nodes. That watch goes through EtcdMetaDriver.listen / listenPrefix, which handed jetcd the bare Consumer<WatchResponse> overload. That overload discards onError and onCompleted, so when jetcd ends a watch (e.g. after a transport reconnect tears down the gRPC stream) the listener stopped receiving events with no log line and no exception. The node kept serving stale schema and nothing reported it. PdMetaDriver does not have this problem because its KvClient already re-subscribes on error.

Main Changes

EtcdMetaDriver.listen / listenPrefix now register a Watch.Listener instead of the bare Consumer overload, surfacing both onError and onCompleted.
- onError only logs at WARN level. jetcd 0.5.9 (WatchImpl.WatcherImpl.handleError) already retries recoverable errors internally by notifying onError and rescheduling resume() on the same watcher; re-subscribing from onError would therefore open a duplicate watch on every transient reconnect.
- onCompleted schedules a re-subscribe after a 1 s backoff on a daemon thread. onCompleted is jetcd's terminal-close signal: the old watcher is already removed and will not recover, so a replacement watch is safe and necessary. This mirrors the self-heal PdMetaDriver already gets from KvClient.
- If the re-subscribe itself throws (e.g. the endpoint is still unreachable), it is retried with the same backoff instead of abandoning recovery after a single failure.
Removed CachedSchemaTransactionV2.resetMetaListenerForReconnect(). It was a manual stopgap added in fix(server): sync hstore schema cache clears #3011 with no callers, and it never shipped in a tagged release. Now that the driver self-heals, the JVM-global register-once flag is correct to stay set; the manual reset has nothing left to do. The lifecycle comment on metaEventListenerRegistered is updated to describe the self-heal behaviour.
The MetaDriver interface is unchanged, so neither implementor needs edits beyond EtcdMetaDriver.

Known limitation: the re-subscribe opens a fresh watch without a stored revision, so any cache-clear events emitted during the short reconnect window are not replayed. This is the same behaviour as PdMetaDriver / KvClient. The change removes the permanent silent failure; it does not add gap replay.

Verifying these changes

Need tests and can be verified as follows:
- New EtcdMetaDriverTest (JUnit + Mockito) mocks the jetcd Client/Watch via the package-private test constructor (no live etcd needed) and asserts:
  1. onCompleted triggers a re-subscribe (a second watch(...) call).
  2. onError does not trigger a re-subscribe (only logs), so no duplicate watch is opened.
  3. The re-created prefix watch preserves the original key and isPrefix WatchOption.
  4. onNext still delivers events to the consumer.
  5. A re-watch that throws once is retried and eventually succeeds (no permanent loss).
- Local run: EtcdMetaDriverTest 5/5, CachedSchemaTransactionTest 18/18, MetaManagerSchemaCacheClearEventTest 6/6 after the stopgap removal.
- Red-green: commenting out the re-subscribe in onCompleted makes the recovery tests fail; restoring it makes them pass.

Does this PR potentially affect the following parts?

Documentation Status

Doc - TODO
Doc - Done
Doc - No Need

) EtcdMetaDriver.listen/listenPrefix handed jetcd a bare Consumer<WatchResponse>, so a terminal watch error (e.g. after a transport reconnect) was swallowed and the JVM-global schema-cache-clear listener died silently: a node stopped receiving cross-node cache-clear events with no error or warning. Switch to the Watch.Listener overload and re-subscribe on onError/onCompleted via a daemon-backed backoff, mirroring the self-heal PdMetaDriver already gets from KvClient. The driver watch now stays live across reconnects, so CachedSchemaTransactionV2's register-once flag staying true is correct; the unused resetMetaListenerForReconnect stopgap and its TODO are removed. Add EtcdMetaDriverTest covering re-subscribe on error/completion and event delivery, and register it in UnitTestSuite.

imbajin

Blocking issues found in the etcd watch recovery path: retryable jetcd watch errors can now create duplicate active watchers, and a failed delayed re-watch attempt can stop recovery permanently. Please fix these before merging; I also left one non-blocking test-hardening note.

Address review on apache#3062: - jetcd 0.5.9 WatcherImpl.handleError already retries recoverable errors by notifying onError and rescheduling resume() on the same watcher. Re-subscribing from onError therefore opened a duplicate watch on every transient reconnect. Re-subscribe now happens only from onCompleted, jetcd's terminal-close signal, where the old watcher is already removed; onError only logs. - Guard the scheduled re-watch: if the re-subscribe itself throws (endpoint still unreachable), it is retried with the same backoff instead of abandoning recovery after a single failure. - Strengthen tests: assert onError does not re-watch, assert the re-created prefix watch preserves the key and prefix WatchOption, and cover a re-watch that throws once then succeeds.

imbajin

Blocking: no. Summary: No obvious issues found in the current head. Evidence: git diff --check origin/master...HEAD; EtcdMetaDriverTest passed 5 tests.

Non-blocking follow-ups:

Please update the PR description to match the latest implementation: onError now only logs because jetcd retries recoverable errors itself, while re-watch is scheduled from terminal onCompleted; the test count is now 5, not 4.
Consider reducing recoverable watch onError logging from WARN with stack trace to INFO/DEBUG or a rate-limited log, since jetcd is expected to self-retry those transient failures.
If HugeGraph treats public core methods as a compatibility surface, consider keeping CachedSchemaTransactionV2.resetMetaListenerForReconnect() as a deprecated no-op or documented fallback. Given it was an uncalled stopgap that did not ship in a tagged release, I do not consider this blocking.

dpol1 added 2 commits June 22, 2026 18:35

Merge branch 'master' into fix/3036-meta-listener-reconnect

8c19b62

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working tests Add or improve test cases labels Jun 22, 2026

This comment was marked as low quality.

Sign in to view

dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 22, 2026

imbajin requested changes Jun 23, 2026

View reviewed changes

dosubot Bot removed the lgtm This PR has been approved by a maintainer label Jun 23, 2026

dpol1 requested a review from imbajin June 23, 2026 07:50

imbajin approved these changes Jun 23, 2026

View reviewed changes

dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 23, 2026

imbajin merged commit 398a5ed into apache:master Jun 23, 2026
15 of 16 checks passed

dpol1 deleted the fix/3036-meta-listener-reconnect branch June 23, 2026 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(core): self-heal etcd meta watch on transport reconnect#3062

fix(core): self-heal etcd meta watch on transport reconnect#3062
imbajin merged 3 commits into
apache:masterfrom
dpol1:fix/3036-meta-listener-reconnect

dpol1 commented Jun 22, 2026 •

edited

Loading

Uh oh!

This comment was marked as low quality.

Uh oh!

imbajin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imbajin left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dpol1 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the PR

Main Changes

Verifying these changes

Does this PR potentially affect the following parts?

Documentation Status

Uh oh!

This comment was marked as low quality.

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imbajin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dpol1 commented Jun 22, 2026 •

edited

Loading

imbajin left a comment •

edited

Loading