Skip to content

Commpath#182

Merged
mplegendre merged 55 commits into
llnl:develfrom
rountree:commpath
Jun 5, 2026
Merged

Commpath#182
mplegendre merged 55 commits into
llnl:develfrom
rountree:commpath

Conversation

@rountree

Copy link
Copy Markdown
Collaborator

Passed all tests on my clone. Let's see what it does over here.

Comment thread src/client/beboot/spindle_bootstrap.c Outdated
Comment thread src/client/client/client.c Outdated
Comment thread src/client/client/client.c Outdated
Comment thread src/client/client/intercept_readlink.c Outdated
Comment thread src/client/client/should_intercept.c Outdated
Comment thread src/client/client_comlib/client_api.c Outdated
Comment thread src/cobo/cobo.c Outdated
Comment thread src/cobo/cobo.c
Comment thread src/server/auditserver/ldcs_audit_server_handlers.c
Comment thread src/utils/parseloc.c Outdated
Previously,

chosen_realized_cachepath was copied into
set_intercept_readlink_cachepath()

chosen_realized_cachepath and chosen_parsed_cachepath were copied
into set_should_intercept_cachepath()

This PR removes both setter functions and makes the original
pointers global.
Removes chosen_cachepath and cachepath_bitindex from
  spindle_launch.h

Updates initialization of matching variables in ldcs_process_data.

determineValidCachePaths() moved from spindle_be.cc to
  ldcs_audit_server_process.c to get ldcs_process_data visibility.

Added #include "parseloc.h" to ldcs_audit_server_process.c to get
  declaration of determineValidCachePaths().

Relocated "parseloc.h" to src/util so ldcs_audit_server_process.c
  could find it.

Trued up signedness of types caused my making "parseloc.h" more
  visible, e.g., cachepath_bitidx is now uint64_t everywhere.
The three-message-reply response is now a single message with
two strings.  The symbolic version of the cachepath is no longer
communicated as it was not being used.
New name is ldcs_audit_server_md_allreduce_AND().

If we get to the point where we're using other allreduce operations
we can solve the problem of duplicating the op list in md-land and
cobo-land.  For now, we're only using one op in md-land, so the
op can go into the function name.
Unlikely it would ever make a difference, but this is much more
correct.
The theory being that eager clients are using an uninitialized
cachepath variable.  By delaying the consensus, the failure should
happen more often.
"sending message of type: request_location_path"
is now
"sending message of type: CHOSEN_CACHEPATH_REQUEST"
Known to affect the symbolic form of candidate cachepaths.  Not
sure that's ever being used, but it's fixed now.
_message_type_to_str() can now be used in cobo_fe_comm.c.

ldcs_audit_server_fe_broadcast() now reports message type.

Only two messages are expected to be routed through there, but
it's the correct way to report it.
Cleanup now takes both commpath and cachepath and prefixes for
removing files created by Spindle.
The original LDCS_LOCATION_MOD checked to see if there were
multiple servers running on a node and, if so, modified the
location string so that each server had its own location.

The code did not handle the case where the directory above
the requested directory was not writeable, e.g., if the user
passed in --location=/tmp, the code would try to create a
directory /tmp-00 for the first server.  That fails.

With commpath and cachepath replacing location, and with new
initialization paths, the existing code would modify only
commpath after the commpath directory had been created.

If the multiple-server case needs to be supported, commpath- and
cachepath-specific code needs to be added back in.
That configure parameter is no longer supported.

Replaced with
        --with-cachepaths=/tmp/commpath/cachepath
        --with-commpath=/tmp/commpath
Replaced assert() with return -1 in:
    src/client/beboot/spindle_bootstrap.c
    src/client/client/client.c

Removed assert() with no replacement in:
    src/client/client/client.c
    src/client/client/intercept_readlink.c
    src/client/client/should_intercept.c

Created Issue llnl#187 to remove debugging code in send_cachepath_query()

send_cachepath_query() now has delay_between_retries of 0.1 seconds and max 1000 retries.
  Also returns immediately in case of network errors.
  Also uses spindle_strdup() instead of strdup().

src/utils/parseloc.c
src/server/auditserver/ldcs_audit_server_handlers.c
    Removed/reclassified logging statements.
@mplegendre mplegendre merged commit 7636cfb into llnl:devel Jun 5, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants