HSJ: make scoring invariant to phyDat level ordering (T-307 primary + T-308 secondary)#250
Merged
Merged
Conversation
The driven HSJ pipeline hard-coded absent_state = 0L, which for the usual
levels c("-","0","1") is the index of the inapplicable token "-", not the
primary's "absent" state "0" (index 1). Primaries coded "0" were therefore
treated as PRESENT, so gain/loss of the structure was never counted; the
HSJ score was insensitive to primary presence/absence (verified: alpha=0
single-absent-tip scored 0 instead of 1).
- Add hsj_absent_state(dataset): 0-based index of the "0" token in `levels`
(falls back to "-"), so the value tracks the dataset's level ordering
instead of being hard-coded. build_tip_labels is level-order generic, so a
constant was fragile regardless of which constant.
- Use it at every driven call site: MaximizeParsimony, Morphy (resample),
TreeLength.phylo and TreeLength.list. Reconcile the test helper to compute
it the same way (was hard-coded 1L).
- C++ score_hierarchy_block: treat the primary as absent when its label is
the absent_state OR the inapplicable token, mirroring recode_hierarchy's
`pri == "0" || pri == "-"` and fixing nested hierarchies where a primary
may itself be inapplicable.
- Regression tests: alpha=0 primary sensitivity, driven-vs-direct agreement,
hsj_absent_state across level orderings, primary-scoring level invariance.
Separate from T-306 (PR #249); discovered during that work.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
agent-check was failing on cpp-search before this branch: the spelling test errored on DropAdd, FarFirst, Fathi, MMDP, MaxMin, Porumbel, Resende, Sayyady, WideSample, medoid, pkg — author surnames, algorithm acronyms and identifiers from recent WideSample work that were never added to the wordlist. Added via spelling::update_wordlist(); restored config/maximin/ warmup that the tool would otherwise have pruned. Unrelated to the HSJ fix but required to get this branch's CI green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
R/WideSample.R conditionally uses highs via requireNamespace() and in \dontrun/guarded examples, but it was never declared, so R CMD check --as-cran (windows job, error-on="warning") failed with "'library' or 'require' call not declared from: 'highs'". Pre-existing on cpp-search from recent WideSample work; required to get this branch's CI green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fitch_label_char()'s uppass resolved ambiguous internal nodes to the
LOWEST SET BIT. Which token that bit denotes depends on the phyDat
`levels` ordering, so differently-ordered levels produced different
secondary reconstructions, different per-branch mismatch counts d, and
therefore different scores (the repro dataset scored 2.5 vs 2.0 at
hsj_alpha=1). A parsimony-style score must not depend on the arbitrary
internal ordering of levels.
Resolve instead toward the best-supported token in the node's own
subtree (max tip count, ties broken by smallest supporting tip index).
Both keys are properties of the tokens and the tree, not of the bit
encoding, giving a strict, deterministic, order-independent resolution
that is still a valid MPR — so the dissimilarity stays non-zero (the
concern that motivated adding the uppass in T-134).
Only the primary term was order-safe before (T-307); this closes the
secondary term. Verified invariant across all level permutations on the
repro plus a 1200-case randomised battery (multistate secondaries, "?"
missing data, random trees, alpha in {0,0.5,1}). Existing HSJ value
assertions are unchanged. Regression test extended to hsj_alpha in
{0.5, 1} across all orderings, plus a multistate-secondary case.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ent-state # Conflicts: # tests/testthat/test-ts-hsj.R
`hsj_absent_state` (added with T-307) breached the package naming conventions: R functions are BigCamelCase and private helpers are dot-prefixed (r-conventions). It was also @export-ed with a public man page despite being internal C++-bridge plumbing that returns a 0-based token index. Rename to `.HSJAbsentState`, drop @export and the man page, and document it with a plain comment like the package's other private helpers (.ParseOneBlock, .HierarchicalResampleWeights). Call sites in MaximizeParsimony.R, Morphy.R, tree_length.R and the test bridge updated. No behaviour change; the function is still reachable via ::: for tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The whole hierarchical-scoring module was written in snake_case, breaching the package convention (functions BigCamelCase; private functions dot- prefixed; variables camelCase). Following the .HSJAbsentState fix, sweep the rest of the module: Exported helpers -> BigCamelCase (snake_case names removed): recode_hierarchy -> RecodeHierarchy (public) hierarchy_from_names -> HierarchyFromNames (public) validate_hierarchy -> ValidateHierarchy (exported-internal) hierarchy_chars -> HierarchyChars (exported-internal) hierarchy_controlling -> HierarchyControlling (exported-internal) Pure C++-bridge plumbing -> private, unexported (dot-prefixed), documented with plain comments like the package's other private helpers: build_tip_labels -> .BuildTipLabels hierarchy_to_blocks -> .HierarchyToBlocks non_hierarchy_weights -> .NonHierarchyWeights (.HSJAbsentState already done) Also: nested helpers (.recode_block -> .RecodeBlock, .flatten_block -> .FlattenBlock); the public `char_names` arg -> `charNames`; and local variables in the dedicated module files (CharacterHierarchy.R, recode_hierarchy.R) camelCased. The internal resample-weights list fields were renamed to camelCase too (`non_hierarchy_weights` -> `nonHierarchyWeights`, `block_counts` -> `blockCounts`), updated in producer, consumer and tests. Call sites updated across R/, tests/, the vignette and the Shiny app; man pages and NAMESPACE regenerated via roxygen2. NEWS records the breaking rename. All hierarchy/HSJ/xform/recode/resample/tree_length tests pass; spelling clean. Local variables in the large shared scoring functions (tree_length.R, Morphy.R, MaximizeParsimony.R) keep their existing `tip_data`/`adj_weight` style — those bridge locals are pervasive across non-hierarchy branches too, so renaming them belongs to a separate cleanup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@-