You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Integrate test job for HugeGraph Computer can take a long time even though the integration suite is small. The slow part is usually not the number of tests, but a long wait in the message/input synchronization path.
An observed CI log repeatedly prints:
EtcdClient - Wait for keys with prefix 'BSP_WORKER_INPUT_DONE' and timeout 86400000ms, expect 1 keys but actual got 0 keys
The same log shows the worker entering input step and starting vertex message sending before the wait:
WorkerService inputstep started
MessageSendManager - Start sending message(type=VERTEX)
So the master is waiting for the worker's BSP_WORKER_INPUT_DONE signal, but the worker has not reached Bsp4Worker.workerInputDone() yet.
Initial code pointers
CI runs mvn test -P integrate-test -ntp in .github/workflows/computer-ci.yml.
The integrate-test profile includes IntegrateTestSuite, which currently contains SenderIntegrateTest.
SenderIntegrateTest has only a few cases, but testOneWorkerWithBusyClient() intentionally slows the send path by wrapping the client's send function with Thread.sleep(100).
WorkerInputManager.loadGraph() sends vertices and edges first. Only after it returns does WorkerService.inputstep() call bsp4Worker.workerInputDone().
ComputerOptions.BSP_WAIT_WORKERS_TIMEOUT and BSP_WAIT_MASTER_TIMEOUT default to 24 hours, so a hidden sender/session/input problem can become a very slow CI wait instead of a fast, actionable failure.
Related prior symptom: #203 reported The origin future must be null in SenderIntegrateTest. That may be in the same control-message/future/session area, but this task is specifically about the slow CI wait and fail-fast/debuggability of the integration test.
Suggested investigation
Reproduce the integration suite with etcd available:
cd computer
mvn test -P integrate-test -Dtest=IntegrateTestSuite -ntp
Confirm which test case spends time before BSP_WORKER_INPUT_DONE. Start with SenderIntegrateTest#testOneWorkerWithBusyClient.
Check whether START/FINISH control futures in QueuedMessageSender can be left stale, completed late, or hidden behind the sender thread. The old [Bug] The origin future must be null #203 stack around futureRef is a useful clue.
Make the test fail fast and print useful diagnostics. Possible directions:
set much smaller bsp.wait_workers_timeout / bsp.wait_master_timeout for integration tests;
add a JUnit/test-level timeout around each integration case;
dump worker/master thread states when the input barrier is not reached;
ensure sender exceptions propagate to both the worker future and the master-side wait;
replace the sleep-based busy-client simulation with a more deterministic back-pressure or blocked-client fixture.
Expected result
Integration tests should not spend many minutes printing only BSP_WORKER_INPUT_DONE wait logs.
If the sender/input path is broken, the test should fail quickly with an actionable error and enough thread/session state to locate the failing component.
The slow/busy-client path should have regression coverage so future changes do not reintroduce the long wait.
Newcomer scope
This is a good newcomer task because the suspected area is narrow: one integration suite, the input-step barrier, and the message sender control future path. A complete fix does not need a large algorithm or distributed-runtime redesign; first improving timeout/diagnostics and then isolating the sender/session condition would already be valuable.
Background
The
Integrate testjob for HugeGraph Computer can take a long time even though the integration suite is small. The slow part is usually not the number of tests, but a long wait in the message/input synchronization path.An observed CI log repeatedly prints:
The same log shows the worker entering input step and starting vertex message sending before the wait:
So the master is waiting for the worker's
BSP_WORKER_INPUT_DONEsignal, but the worker has not reachedBsp4Worker.workerInputDone()yet.Initial code pointers
mvn test -P integrate-test -ntpin.github/workflows/computer-ci.yml.integrate-testprofile includesIntegrateTestSuite, which currently containsSenderIntegrateTest.SenderIntegrateTesthas only a few cases, buttestOneWorkerWithBusyClient()intentionally slows the send path by wrapping the client's send function withThread.sleep(100).WorkerInputManager.loadGraph()sends vertices and edges first. Only after it returns doesWorkerService.inputstep()callbsp4Worker.workerInputDone().ComputerOptions.BSP_WAIT_WORKERS_TIMEOUTandBSP_WAIT_MASTER_TIMEOUTdefault to 24 hours, so a hidden sender/session/input problem can become a very slow CI wait instead of a fast, actionable failure.Relevant files:
Related prior symptom: #203 reported
The origin future must be nullinSenderIntegrateTest. That may be in the same control-message/future/session area, but this task is specifically about the slow CI wait and fail-fast/debuggability of the integration test.Suggested investigation
Reproduce the integration suite with etcd available:
Confirm which test case spends time before
BSP_WORKER_INPUT_DONE. Start withSenderIntegrateTest#testOneWorkerWithBusyClient.Trace the input path:
Check whether START/FINISH control futures in
QueuedMessageSendercan be left stale, completed late, or hidden behind the sender thread. The old [Bug] The origin future must be null #203 stack aroundfutureRefis a useful clue.Make the test fail fast and print useful diagnostics. Possible directions:
bsp.wait_workers_timeout/bsp.wait_master_timeoutfor integration tests;Expected result
BSP_WORKER_INPUT_DONEwait logs.Newcomer scope
This is a good newcomer task because the suspected area is narrow: one integration suite, the input-step barrier, and the message sender control future path. A complete fix does not need a large algorithm or distributed-runtime redesign; first improving timeout/diagnostics and then isolating the sender/session condition would already be valuable.