fix(core): self-heal etcd meta watch on transport reconnect#3062
Merged
Conversation
) EtcdMetaDriver.listen/listenPrefix handed jetcd a bare Consumer<WatchResponse>, so a terminal watch error (e.g. after a transport reconnect) was swallowed and the JVM-global schema-cache-clear listener died silently: a node stopped receiving cross-node cache-clear events with no error or warning. Switch to the Watch.Listener overload and re-subscribe on onError/onCompleted via a daemon-backed backoff, mirroring the self-heal PdMetaDriver already gets from KvClient. The driver watch now stays live across reconnects, so CachedSchemaTransactionV2's register-once flag staying true is correct; the unused resetMetaListenerForReconnect stopgap and its TODO are removed. Add EtcdMetaDriverTest covering re-subscribe on error/completion and event delivery, and register it in UnitTestSuite.
imbajin
requested changes
Jun 23, 2026
imbajin
left a comment
Member
There was a problem hiding this comment.
Blocking issues found in the etcd watch recovery path: retryable jetcd watch errors can now create duplicate active watchers, and a failed delayed re-watch attempt can stop recovery permanently. Please fix these before merging; I also left one non-blocking test-hardening note.
Address review on apache#3062: - jetcd 0.5.9 WatcherImpl.handleError already retries recoverable errors by notifying onError and rescheduling resume() on the same watcher. Re-subscribing from onError therefore opened a duplicate watch on every transient reconnect. Re-subscribe now happens only from onCompleted, jetcd's terminal-close signal, where the old watcher is already removed; onError only logs. - Guard the scheduled re-watch: if the re-subscribe itself throws (endpoint still unreachable), it is retried with the same backoff instead of abandoning recovery after a single failure. - Strengthen tests: assert onError does not re-watch, assert the re-created prefix watch preserves the key and prefix WatchOption, and cover a re-watch that throws once then succeeds.
imbajin
approved these changes
Jun 23, 2026
Member
There was a problem hiding this comment.
Blocking: no. Summary: No obvious issues found in the current head. Evidence: git diff --check origin/master...HEAD; EtcdMetaDriverTest passed 5 tests.
Non-blocking follow-ups:
- Please update the PR description to match the latest implementation:
onErrornow only logs because jetcd retries recoverable errors itself, while re-watch is scheduled from terminalonCompleted; the test count is now 5, not 4. - Consider reducing recoverable watch
onErrorlogging from WARN with stack trace to INFO/DEBUG or a rate-limited log, since jetcd is expected to self-retry those transient failures. - If HugeGraph treats public core methods as a compatibility surface, consider keeping
CachedSchemaTransactionV2.resetMetaListenerForReconnect()as a deprecated no-op or documented fallback. Given it was an uncalled stopgap that did not ship in a tagged release, I do not consider this blocking.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose of the PR
CachedSchemaTransactionV2registers a JVM-global watch on the meta store so that a schema change on one node clears the schema cache on the other nodes. That watch goes throughEtcdMetaDriver.listen/listenPrefix, which handed jetcd the bareConsumer<WatchResponse>overload. That overload discardsonErrorandonCompleted, so when jetcd ends a watch (e.g. after a transport reconnect tears down the gRPC stream) the listener stopped receiving events with no log line and no exception. The node kept serving stale schema and nothing reported it.PdMetaDriverdoes not have this problem because itsKvClientalready re-subscribes on error.Main Changes
EtcdMetaDriver.listen/listenPrefixnow register aWatch.Listenerinstead of the bareConsumeroverload, surfacing bothonErrorandonCompleted.onErroronly logs at WARN level. jetcd 0.5.9 (WatchImpl.WatcherImpl.handleError) already retries recoverable errors internally by notifyingonErrorand reschedulingresume()on the same watcher; re-subscribing fromonErrorwould therefore open a duplicate watch on every transient reconnect.onCompletedschedules a re-subscribe after a 1 s backoff on a daemon thread.onCompletedis jetcd's terminal-close signal: the old watcher is already removed and will not recover, so a replacement watch is safe and necessary. This mirrors the self-healPdMetaDriveralready gets fromKvClient.CachedSchemaTransactionV2.resetMetaListenerForReconnect(). It was a manual stopgap added in fix(server): sync hstore schema cache clears #3011 with no callers, and it never shipped in a tagged release. Now that the driver self-heals, the JVM-global register-once flag is correct to stay set; the manual reset has nothing left to do. The lifecycle comment onmetaEventListenerRegisteredis updated to describe the self-heal behaviour.MetaDriverinterface is unchanged, so neither implementor needs edits beyondEtcdMetaDriver.Known limitation: the re-subscribe opens a fresh watch without a stored revision, so any cache-clear events emitted during the short reconnect window are not replayed. This is the same behaviour as
PdMetaDriver/KvClient. The change removes the permanent silent failure; it does not add gap replay.Verifying these changes
EtcdMetaDriverTest(JUnit + Mockito) mocks the jetcdClient/Watchvia the package-private test constructor (no live etcd needed) and asserts:onCompletedtriggers a re-subscribe (a secondwatch(...)call).onErrordoes not trigger a re-subscribe (only logs), so no duplicate watch is opened.isPrefixWatchOption.onNextstill delivers events to the consumer.EtcdMetaDriverTest5/5,CachedSchemaTransactionTest18/18,MetaManagerSchemaCacheClearEventTest6/6 after the stopgap removal.onCompletedmakes the recovery tests fail; restoring it makes them pass.Does this PR potentially affect the following parts?
Documentation Status
Doc - TODODoc - DoneDoc - No Need