Skip to content

HSJ: make scoring invariant to phyDat level ordering (T-307 primary + T-308 secondary)#250

Merged
ms609 merged 7 commits into
cpp-searchfrom
feature/hsj-absent-state
Jun 15, 2026
Merged

HSJ: make scoring invariant to phyDat level ordering (T-307 primary + T-308 secondary)#250
ms609 merged 7 commits into
cpp-searchfrom
feature/hsj-absent-state

Conversation

@ms609

@ms609 ms609 commented Jun 15, 2026

Copy link
Copy Markdown
Owner

@-

ms609 and others added 7 commits June 15, 2026 08:12
The driven HSJ pipeline hard-coded absent_state = 0L, which for the usual
levels c("-","0","1") is the index of the inapplicable token "-", not the
primary's "absent" state "0" (index 1). Primaries coded "0" were therefore
treated as PRESENT, so gain/loss of the structure was never counted; the
HSJ score was insensitive to primary presence/absence (verified: alpha=0
single-absent-tip scored 0 instead of 1).

- Add hsj_absent_state(dataset): 0-based index of the "0" token in `levels`
  (falls back to "-"), so the value tracks the dataset's level ordering
  instead of being hard-coded. build_tip_labels is level-order generic, so a
  constant was fragile regardless of which constant.
- Use it at every driven call site: MaximizeParsimony, Morphy (resample),
  TreeLength.phylo and TreeLength.list. Reconcile the test helper to compute
  it the same way (was hard-coded 1L).
- C++ score_hierarchy_block: treat the primary as absent when its label is
  the absent_state OR the inapplicable token, mirroring recode_hierarchy's
  `pri == "0" || pri == "-"` and fixing nested hierarchies where a primary
  may itself be inapplicable.
- Regression tests: alpha=0 primary sensitivity, driven-vs-direct agreement,
  hsj_absent_state across level orderings, primary-scoring level invariance.

Separate from T-306 (PR #249); discovered during that work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
agent-check was failing on cpp-search before this branch: the spelling
test errored on DropAdd, FarFirst, Fathi, MMDP, MaxMin, Porumbel, Resende,
Sayyady, WideSample, medoid, pkg — author surnames, algorithm acronyms and
identifiers from recent WideSample work that were never added to the
wordlist. Added via spelling::update_wordlist(); restored config/maximin/
warmup that the tool would otherwise have pruned. Unrelated to the HSJ fix
but required to get this branch's CI green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
R/WideSample.R conditionally uses highs via requireNamespace() and in
\dontrun/guarded examples, but it was never declared, so R CMD check
--as-cran (windows job, error-on="warning") failed with "'library' or
'require' call not declared from: 'highs'". Pre-existing on cpp-search
from recent WideSample work; required to get this branch's CI green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fitch_label_char()'s uppass resolved ambiguous internal nodes to the
LOWEST SET BIT. Which token that bit denotes depends on the phyDat
`levels` ordering, so differently-ordered levels produced different
secondary reconstructions, different per-branch mismatch counts d, and
therefore different scores (the repro dataset scored 2.5 vs 2.0 at
hsj_alpha=1). A parsimony-style score must not depend on the arbitrary
internal ordering of levels.

Resolve instead toward the best-supported token in the node's own
subtree (max tip count, ties broken by smallest supporting tip index).
Both keys are properties of the tokens and the tree, not of the bit
encoding, giving a strict, deterministic, order-independent resolution
that is still a valid MPR — so the dissimilarity stays non-zero (the
concern that motivated adding the uppass in T-134).

Only the primary term was order-safe before (T-307); this closes the
secondary term. Verified invariant across all level permutations on the
repro plus a 1200-case randomised battery (multistate secondaries, "?"
missing data, random trees, alpha in {0,0.5,1}). Existing HSJ value
assertions are unchanged. Regression test extended to hsj_alpha in
{0.5, 1} across all orderings, plus a multistate-secondary case.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ent-state

# Conflicts:
#	tests/testthat/test-ts-hsj.R
`hsj_absent_state` (added with T-307) breached the package naming
conventions: R functions are BigCamelCase and private helpers are
dot-prefixed (r-conventions). It was also @export-ed with a public man
page despite being internal C++-bridge plumbing that returns a 0-based
token index.

Rename to `.HSJAbsentState`, drop @export and the man page, and document
it with a plain comment like the package's other private helpers
(.ParseOneBlock, .HierarchicalResampleWeights). Call sites in
MaximizeParsimony.R, Morphy.R, tree_length.R and the test bridge updated.

No behaviour change; the function is still reachable via ::: for tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The whole hierarchical-scoring module was written in snake_case, breaching
the package convention (functions BigCamelCase; private functions dot-
prefixed; variables camelCase).  Following the .HSJAbsentState fix, sweep
the rest of the module:

Exported helpers -> BigCamelCase (snake_case names removed):
  recode_hierarchy   -> RecodeHierarchy        (public)
  hierarchy_from_names -> HierarchyFromNames    (public)
  validate_hierarchy -> ValidateHierarchy       (exported-internal)
  hierarchy_chars    -> HierarchyChars          (exported-internal)
  hierarchy_controlling -> HierarchyControlling (exported-internal)

Pure C++-bridge plumbing -> private, unexported (dot-prefixed), documented
with plain comments like the package's other private helpers:
  build_tip_labels   -> .BuildTipLabels
  hierarchy_to_blocks -> .HierarchyToBlocks
  non_hierarchy_weights -> .NonHierarchyWeights
  (.HSJAbsentState already done)

Also: nested helpers (.recode_block -> .RecodeBlock, .flatten_block ->
.FlattenBlock); the public `char_names` arg -> `charNames`; and local
variables in the dedicated module files (CharacterHierarchy.R,
recode_hierarchy.R) camelCased.  The internal resample-weights list fields
were renamed to camelCase too (`non_hierarchy_weights` -> `nonHierarchyWeights`,
`block_counts` -> `blockCounts`), updated in producer, consumer and tests.

Call sites updated across R/, tests/, the vignette and the Shiny app; man
pages and NAMESPACE regenerated via roxygen2.  NEWS records the breaking
rename.  All hierarchy/HSJ/xform/recode/resample/tree_length tests pass;
spelling clean.

Local variables in the large shared scoring functions (tree_length.R,
Morphy.R, MaximizeParsimony.R) keep their existing `tip_data`/`adj_weight`
style — those bridge locals are pervasive across non-hierarchy branches too,
so renaming them belongs to a separate cleanup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ms609 ms609 merged commit 317c1e4 into cpp-search Jun 15, 2026
5 of 9 checks passed
@ms609 ms609 deleted the feature/hsj-absent-state branch June 15, 2026 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant