Skip to content

Local Aware IBM#1378

Merged
sbryngelson merged 112 commits into
MFlowCode:masterfrom
danieljvickers:local-aware-ibm
Jun 1, 2026
Merged

Local Aware IBM#1378
sbryngelson merged 112 commits into
MFlowCode:masterfrom
danieljvickers:local-aware-ibm

Conversation

@danieljvickers
Copy link
Copy Markdown
Member

@danieljvickers danieljvickers commented Apr 23, 2026

Description

The current code has an issue scaling much past 10k particles due to limitations in the MPIAllReduceSum in the IB force computation. This PR attempts to alleviate this by limiting the number of IBs any given rank can be aware of to its neighbors. This turns the AllReduce compute to a MPI neighbor computation, removing the communication bottlneck. To support this, a massive overhaul of IB ownership between ranks was required.

Type of change

  • Refactor

Testing

TBD

Checklist

  • I added or updated tests for new behavior

AI code reviews

Reviews are not triggered automatically. To request a review, comment on the PR:

  • @coderabbitai review — incremental review (new changes only)
  • @coderabbitai full review — full review from scratch
  • /review — Qodo review
  • /improve — Qodo code suggestions
  • @claude full review — Claude full review (also triggers on PR open/reopen/ready)
  • Add label claude-full-review — Claude full review via label

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

Claude Code Review

Head SHA: 9879df3

Files changed:

  • 21
  • src/simulation/m_ibm.fpp
  • src/simulation/m_start_up.fpp
  • src/simulation/m_collisions.fpp
  • src/simulation/m_global_parameters.fpp
  • src/common/m_constants.fpp
  • src/common/m_derived_types.fpp
  • src/common/m_mpi_common.fpp
  • src/post_process/m_data_output.fpp
  • src/simulation/m_ib_patches.fpp
  • src/simulation/m_time_steppers.fpp

Findings:

Critical: Stack overflow in s_reduce_ib_patch_array (src/simulation/m_start_up.fpp)

subroutine s_reduce_ib_patch_array()
    type(ib_patch_parameters), dimension(num_ib_patches_max) :: patch_ib_gbl

patch_ib_gbl is a local (stack-allocated) variable dimensioned by num_ib_patches_max, which was just raised from 50,000 to 2,050,000 in src/common/m_constants.fpp. ib_patch_parameters contains a character(LEN=pathlen_max) field (pathlen_max=400) plus ~500 bytes of numeric fields — roughly 900 bytes per element. At 2,050,000 elements this is ~1.8 GB on the call stack, which will segfault on every platform (typical stack limits are 8–64 MB). This variable must be heap-allocated (allocate/deallocate).


High: Debug print left in production code (src/simulation/m_ibm.fpp)

print *, proc_rank, " New Owner ", patch_ib(k)%gbl_patch_id  ! TODO :: REMOVE THIS DEBUG PRINT

This fires on every ownership transfer for every locally-tracked IB, generating unbounded output at scale. The ! TODO :: REMOVE THIS DEBUG PRINT marker confirms it is unintentional.


Medium: Commented-out code in m_collisions.fpp leaves pid2 lookup absent (src/simulation/m_collisions.fpp)

! call s_get_neighborhood_idx(pid1, pid1) ! global patch ID -> local index call s_get_neighborhood_idx(pid2, pid2)
if (pid1 <= 0 .or. pid2 <= 0) cycle

The comment text contains call s_get_neighborhood_idx(pid2, pid2) — what appears to be a second intended lookup that was accidentally folded into the comment instead of being left as executable code. Neither lookup actually executes: pid1 and pid2 are the raw decoded global IDs from s_decode_patch_periodicity, not local indices. The guard if (pid1 <= 0 .or. pid2 <= 0) cycle would never trigger for valid global IDs (which are ≥ 1), meaning the subsequent patch_ib(pid1) and patch_ib(pid2) accesses use global IDs as local array indices, which is out-of-bounds when num_ibs < num_gbl_ibs. If this is intentional (global IDs equal local indices in the no-MPI / single-rank path), that should be documented; otherwise both lookups need to be uncommented.


Low: GPU_UPDATE(device=[patch_ib(ib_idx)%moment]) removed without replacement (src/simulation/m_ibm.fpp)

-            patch_ib(ib_marker)%moment = moment*patch_ib(ib_marker)%mass/(count*cell_volume)
-            $:GPU_UPDATE(device='[patch_ib(ib_marker)%moment]')
+            patch_ib(ib_idx)%moment = moment*patch_ib(ib_idx)%mass/(count*cell_volume)

The per-field device update was removed. If patch_ib(i)%moment is consumed on the GPU before the next full GPU_UPDATE(device=[patch_ib(1:num_ibs)]) (e.g., in a subsequent call to s_ibm_correct_state within the same time step), the device will use a stale value. Verify that a covering full-struct update always precedes any GPU use of %moment after s_compute_moment_of_inertia is called from s_propagate_immersed_boundaries.

@danieljvickers
Copy link
Copy Markdown
Member Author

Addressing AI comments:

All of the critical issues are incorrect for one reason or another. I have gone through and verified that the relevant chunks of code are in place to address the concerns it has.

As for the neighbor boundaries, they are never called in a GPU region, so this comment is mute. I have removed the GPU routine call to help with compilation in response.

The file per process F claim is just false, and I tested that the code runs find without it. But it is true that we do not need other ranks to be doing any of this calculation regardless. I extended the rank 0 check in response.

nreqs is in fact reset between loop iterations.

Snapshot arrays are in fact reset between loop iterations

It is confused about the need to GPU allocate arrays for patch_ib. Because the array is used with a declare, but never actually allocated on the GPU, we just have the GPU pointer. When we perform the first copy to the GPU to allocate the array size, we have already done the reshape. This means that the GPU array is properly sized. This is important because the size of the patch_ib array is significant in the new code and could causes issues with limiting problem size because the GPU must be able to hold the global patch_ib array, even though it would then immediately be deallocated and reallocated to a smaller array. If the number of ib patches grows to the extent that it costs many for GB of memory, this could cause memory limitations. The current implementation should be the correct way to go.

the moving IBM flag should not be recomputed. You want it to be the same on all ranks. There is argument that it should be moved to the case level for case optimizaiton later, but that is a future PR.

Force arrays getting moved is going to be handled in a future PR that already has a github issue I created yesterday.

Prohibit for number of local IBs resolved.

In summary, two of these I made minor adjustments for. One I reject for reasons that I believe are valid and better for long-term code support. All others are incorrect.

@danieljvickers
Copy link
Copy Markdown
Member Author

danieljvickers commented May 27, 2026

General Note: Mac errors are on sections of the code that were not touched by this PR (bubbles), and it is a random dyl linker error. I get the same error on reldebug on a chemsitry case. I get the same error locally, but on a different set of tests. This seems like a MacOS instability issue, and not related to the particular PR. Going to attempt to rerun. If ti fails again, but on different tests, then it looks like the MacOS test suite checks may be broken.

Edit: Rerunning the job changed the jobs which failed. This appears to be a spurious failure for MacOS specifically. It appears to be specific to the branch, as other branch are not experiencing it. Will investigate.

@sbryngelson sbryngelson added the enhancement New feature or request label May 31, 2026
Daniel Vickers added 2 commits May 31, 2026 17:17
@sbryngelson sbryngelson force-pushed the local-aware-ibm branch 2 times, most recently from 38b786d to 0c56440 Compare June 1, 2026 07:08
SVS87 added a commit to SVS87/MFC that referenced this pull request Jun 1, 2026
Add matching @:DEALLOCATE calls in:
- m_derived_variables
- m_global_parameters
- m_hyperelastic
- m_rhs
- m_riemann_solvers
- m_time_steppers

Remaining files not yet addressed:
- src/common/m_boundary_common.fpp, m_helper.fpp, m_model.fpp,
  m_mpi_common.fpp, m_variables_conversion.fpp
- src/post_process/m_start_up.fpp
- src/simulation/m_acoustic_src.fpp, m_bubbles_EE.fpp,
  m_ibm.fpp (pending PR MFlowCode#1378), m_qbmm.fpp

Part of MFlowCode#1459
@danieljvickers
Copy link
Copy Markdown
Member Author

@sbryngelson review?

@sbryngelson sbryngelson merged commit 08d12f8 into MFlowCode:master Jun 1, 2026
81 checks passed
@danieljvickers danieljvickers mentioned this pull request Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Development

Successfully merging this pull request may close these issues.

2 participants