Skip to content

[Cosmos][WIP]: Normalize region names passed as preferred or exclude regions.#49090

Open
jeet1995 wants to merge 51 commits into
Azure:mainfrom
jeet1995:squad/region-name-mapper
Open

[Cosmos][WIP]: Normalize region names passed as preferred or exclude regions.#49090
jeet1995 wants to merge 51 commits into
Azure:mainfrom
jeet1995:squad/region-name-mapper

Conversation

@jeet1995
Copy link
Copy Markdown
Member

@jeet1995 jeet1995 commented May 7, 2026

Problem

Customers passing region names in non-canonical forms ("westus3", "WEST US 3", "west-us-3", "west_us_3") hit routing failures because the SDK stored regions in inconsistent shapes and some code paths used case-sensitive String.equals()/List.contains() against server-returned names like "west us 3". A related bug in PPCB re-evaluation silently re-added user-excluded regions as retry targets when the case didn't match.

Definitions

  • Canonical = official Cosmos display form: "West US 3", "East US". Sourced from Settings.xml regionToIdMapping.
  • Normalized = lowercase with spaces, hyphens, and underscores stripped: "westus3". So "West US 3", "WEST-US-3", "west_us_3", and "westus3" all normalize to "westus3". Used as internal routing-map key and URL suffix.
  • Server-returned = getName().toLowerCase(Locale.ROOT): "west us 3" — lowercase, spaces preserved.

Design rationale

One source of truth. RegionUtils owns the canonical ↔ normalized ↔ region-id mapping. Static map keyed by canonical names, derived maps built once with fail-fast on duplicates. getCanonicalRegionName(x) returns canonical form for known regions and the input as-is for unknown ones; getNormalizedRegionName(x) always returns the normalized form.

Public API preserves customer input. ConnectionPolicy.setPreferredRegions(...) and CosmosExcludedRegions store the customer's raw input verbatim. Diagnostics report what the customer supplied. All canonicalization is internal to LocationCache.

Two-map routing inside LocationCache. The forward endpoint map is now stored in two parallel forms:

  • availableXxxRegionalRoutingContextsByRegionName — keyed by server-form-lowercased ("west us 3"). Unchanged; preserves the contract used by resolveServiceEndpoint, resolveFaultInjectionEndpoint, the PPCB internal-exclude path, and the public getRegionName diagnostic surface.
  • availableXxxRegionalRoutingContextsByNormalizedRegionName — keyed by getNormalizedRegionName(serverName) ("westus3"). Used only for customer-supplied preferred-region lookups in getPreferredAvailableRoutingContexts and shouldRefreshEndpoints. DatabaseAccountLocationsInfo.preferredLocations is stored in normalized form so the lookup is a constant-time hit.

This split means any input variant — canonical, normalized, hyphenated, underscored, or an unknown region in no-space form — resolves consistently for routing, while diagnostics keep returning the server form.

Exclude regions are pre-normalized once per request inside getApplicableRegionRoutingContexts (nulls dropped), then matched in normalized form against the normalized server-region name. No per-request cache, no unbounded growth, no NPE on null entries.

Bug fix: PPCB re-evaluate

LocationCache.reevaluate() previously compared user-excluded regions with case-sensitive List.contains(), so "West US 3".contains("west us 3") returned false and a user-excluded region was silently re-added as a retry target when PPCB had also excluded it. Now matched via normalized form on both sides.

Tests

  • Unit: 12 in RegionUtilsNormalizationTest, 20 in LocationCacheTest (including unknown-region routing tests), 1 PPCB re-evaluate regression in ApplicableRegionEvaluatorTest, ID-map consistency check in RegionUtilsTests. Full azure-cosmos unit suite — 2469 tests — passes.
  • E2E (azure-cosmos-tests): non-canonical preferred + exclude + fault injection across 7 op types in ExcludeRegionTests; non-canonical + eager availability strategy + 503 hedging; non-canonical + PPCB + 503 circuit-breaker failover.
  • Spark connector E2E (RegionNameNormalizationE2EITest, new): shared test under azure-cosmos-spark_3/.../com/azure/cosmos/spark/, exercised against both azure-cosmos-spark_3-5_2-12 (Spark 3.5 LTS) and azure-cosmos-spark_4-1_2-13 (Spark 4.1) via the existing copy-shared-test-sources plumbing. Two scenarios: (1) probe round-trip — discover the SDK's default canonical region from CosmosDiagnostics, feed back its normalized form, assert routing pins to canonical; (2) differential — pick a readable region distinct from the probe default, normalize and feed as preferredRegionsList, assert reads route to the canonical secondary (proving routing moved as a result of mapper canonicalization). Captured via a test-scoped CosmosDiagnosticsHandler registered through TestCosmosClientBuilderInterceptor.

Customers passing region names in non-canonical forms (e.g., 'west us 3'
instead of 'West US 3') hit routing issues because the Java SDK stores
region names in different forms and some comparisons use case-sensitive
String.equals()/List.contains().

Changes:
- Add RegionNameMapper: strips spaces + case-insensitive lookup against
  90+ known Azure regions to produce canonical names (e.g., 'westus3' or
  'west us 3' -> 'West US 3'). Unknown regions pass through as-is.
- ConnectionPolicy.setPreferredRegions(): normalize + order-preserving
  dedupe at entry point.
- LocationCache constructor: apply RegionNameMapper before toLowerCase
  for defense-in-depth.
- Fix case-sensitive List.contains() bug in reevaluate() (line 502):
  use containsRegionIgnoreCase() instead.
- Normalize user-configured exclude regions at point of use in
  getApplicableRegionRoutingContexts() to prevent mismatches with
  PPCB-derived lowercased region names.
- Add RegionNameMapperTest with 43 unit tests covering case variants,
  space removal, passthrough, null/empty handling.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the Cosmos label May 7, 2026
jeet1995 and others added 7 commits May 7, 2026 10:12
The static region list in RegionNameMapper goes stale when new Azure
regions are added. Fix: add a ConcurrentHashMap-backed dynamic tier
that learns canonical region names from server responses.

- RegionNameMapper.registerRegionName(): registers canonical names from
  DatabaseAccountLocation (called from LocationCache.addRoutingContexts).
  After the first account read, even new regions like 'West US 4' can
  normalize 'westus4' → 'West US 4'.
- getCosmosDBRegionName(): checks static map first, then dynamic map.
- Add 2 new tests for dynamic registration behavior.
- 45/45 RegionNameMapperTest pass, 32/32 LocationCacheTest pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous commit had stash conflict markers (<<<<<<< Updated upstream /
>>>>>>> Stashed changes) left in RegionNameMapper.java and
RegionNameMapperTest.java. Rewrote both files clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Merge the separate RegionNameMapper into RegionNameToRegionIdMap as the
single source of truth for region names. This eliminates maintaining two
parallel region lists that can drift out of sync.

Changes:
- Delete RegionNameMapper.java — normalization logic moved into
  RegionNameToRegionIdMap.
- RegionNameToRegionIdMap now provides region ID mapping (existing) AND
  region name normalization (new) from one canonical list.
- Sync REGION_NAME_TO_REGION_ID_MAPPINGS with backend RegionToIdMap.cs:
  add Bleu France Central/South (107/108), Delos Cloud Germany
  Central/North (109/110), Singapore Central/North (111/112), fix
  'easteurope' → 'East Europe' (54).
- Build NORMALIZED_REGION_NAME_TO_REGION_ID_MAPPINGS programmatically
  from REGION_NAME_TO_REGION_ID_MAPPINGS instead of manual duplication.
- Normalization static map seeded from ID map keys + additional regions
  without IDs yet (from .NET SDK Regions.cs).
- Rename test: RegionNameMapperTest → RegionNameToRegionIdMapNormalizationTest.
- Update ConnectionPolicy and LocationCache references.
- All 78 tests pass (45 normalization + 32 LocationCache + 1 consistency).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 7 tests to LocationCacheTest using real Azure region names to verify
that preferred regions and exclude regions work correctly with
non-canonical input:

- preferredRegions_lowercaseShouldMatchCanonical: 'west us 3' → West US 3
- preferredRegions_noSpacesShouldMatchCanonical: 'westus3' → West US 3
- preferredRegions_uppercaseShouldMatchCanonical: 'WEST US 3' → West US 3
- preferredRegions_duplicateAfterNormalizationShouldDedupe: 'westus3' +
  'West US 3' deduped to single entry
- excludeRegions_lowercaseNoSpacesShouldExclude: 'westus3' excludes
  West US 3
- excludeRegions_mixedCasingShouldExclude: 'EAST us' excludes East US
- excludeRegions_requestLevelNoSpacesShouldExclude: request-level
  'eastus' excludes East US

All 39 LocationCacheTest unit tests pass (32 existing + 7 new).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create a second CosmosClient with space-stripped preferred regions
(e.g., 'westus3' instead of 'west us 3') and verify that routing and
region exclusion work identically to canonical names.

New tests:
- nonCanonicalPreferredRegions_shouldRouteCorrectly: client with
  space-stripped preferred regions routes to correct first region
  (7 operation types via DataProvider)
- nonCanonicalExcludeRegion_shouldSkipExcludedRegion: excluding with
  space-stripped name (e.g., 'westus3') correctly skips that region
  (7 operation types via DataProvider)
- uppercaseExcludeRegion_shouldSkipExcludedRegion: excluding with
  UPPERCASE name correctly skips that region

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add tests that create CosmosClients with space-stripped preferred regions
(e.g., 'westus3' instead of 'West US 3') and verify correct routing.

FaultInjectionWithAvailabilityStrategyTestsBase:
- Add nonCanonicalWriteableRegions field (space-stripped from server names)
- readAfterCreation_nonCanonicalPreferredRegions_shouldRouteCorrectly:
  creates client with space-stripped regions, reads with eager availability
  strategy, verifies first contacted region matches expected canonical name

PerPartitionCircuitBreakerE2ETests:
- nonCanonicalPreferredRegions_ppcbShouldStillRouteCorrectly: creates
  client with space-stripped regions, performs create+read, verifies
  diagnostics show routing to correct first preferred region

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Simplify RegionNameToRegionIdMap by removing the ConcurrentHashMap-backed
dynamic registration tier. Unknown regions are returned as-is, which is
sufficient because LocationCache's CaseInsensitiveMap + toLowerCase
handles the matching for any region the server returns.

- Remove DYNAMIC_NORMALIZED_TO_CANONICAL and registerRegionName()
- Remove registerRegionName() call from LocationCache.addRoutingContexts()
- Replace dynamic registration tests with passthrough assertion tests
- 84/84 tests pass (44 normalization + 39 LocationCache + 1 consistency)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 changed the title Port RegionNameMapper from .NET SDK to normalize region names [Cosmos][WIP]: Normalize region names passed as preferred or exclude regions. May 7, 2026
jeet1995 and others added 5 commits May 7, 2026 12:21
…sible

Duplicate preferred regions after normalization (e.g., ['westus3', 'West US 3']
both becoming 'West US 3') are an obvious customer misconfiguration. The SDK
should not silently mask this — let the duplicates pass through so the customer
can see and fix their config.

Also clarify code comments for the escape hatch behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 10 regions from the authoritative LocationNames.cs that were missing
from the normalization map: East US SLV, Southeast US, Southwest US,
South Central US 2, Southeast US 3, Southeast US 5, Northeast US 5,
India South Central, Southeast Asia 3, West Central US FRE.

Region ID mappings remain a subset (only regions with assigned IDs from
RegionToIdMap.cs). The normalization map is the superset sourced from
LocationNames.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the previous RegionToIdMap.cs-based ID map with the complete
regionToIdMapping from Settings.xml (IDs 1-124). This is the
authoritative source for region name ↔ ID mappings used for session
token region-level progress tracking.

- Add 44 new region IDs (74-124): Brazil Southeast, West US 3, Qatar
  Central, Italy North, East US 3, Saudi Arabia East, etc.
- Remove separate 'additional canonical names' block — all canonical
  names now derive from the ID map since Settings.xml is the superset.
- Remove 'Greece Central' which was not in any authoritative source.
- Update javadoc and code comments to reference Settings.xml as the
  authoritative source instead of RegionToIdMap.cs.
- 83/83 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ationCache

- Rename class to RegionUtils — better reflects its dual role (ID mapping
  + region name normalization).
- Move normalizeRegionNames() and containsRegionIgnoreCase() from
  LocationCache private helpers into RegionUtils as public static methods.
- Rename all test files to match: RegionUtilsNormalizationTest,
  RegionUtilsTests.
- Update all references across ConnectionPolicy, LocationCache,
  PartitionScopedRegionLevelProgress, RxDocumentClientImpl.
- 83/83 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented May 7, 2026

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 marked this pull request as ready for review May 7, 2026 17:37
@jeet1995 jeet1995 requested a review from kirankumarkolli as a code owner May 7, 2026 17:37
Copilot AI review requested due to automatic review settings May 7, 2026 17:37
@jeet1995 jeet1995 requested review from a team as code owners May 7, 2026 17:37
@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented May 7, 2026

@sdkReviewAgent

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Cosmos DB routing mismatches caused by non-canonical Azure region name inputs by introducing centralized region normalization and applying it to preferred/excluded region handling across the Cosmos Java SDK routing stack.

Changes:

  • Added RegionUtils as the single source of truth for region ID mappings and canonical region name normalization, and updated call sites to use it.
  • Normalized preferred/excluded regions in ConnectionPolicy and LocationCache, including a fix for a case-sensitive exclude-region check in PPCB reevaluation logic.
  • Added/updated unit and E2E tests to validate routing behavior with non-canonical region inputs and updated the Cosmos changelog entry.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java Switches region ID lookup to RegionUtils.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/RegionUtils.java Introduces region normalization + region ID mapping utilities.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/RegionNameToRegionIdMap.java Removes the old region mapping class in favor of RegionUtils.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/LocationCache.java Normalizes excluded regions and fixes PPCB exclude-region comparison behavior.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/PartitionScopedRegionLevelProgress.java Updates region ID/name lookups to RegionUtils.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ConnectionPolicy.java Normalizes preferred regions at configuration time.
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the normalization + PPCB exclude-region fix.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/PerPartitionCircuitBreakerE2ETests.java Adds E2E coverage for PPCB routing with non-canonical preferred regions.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/routing/RegionUtilsNormalizationTest.java Adds unit coverage for normalization behavior.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/routing/LocationCacheTest.java Adds integration-style unit tests for preferred/exclude region normalization with real region names.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/RegionUtilsTests.java Updates the existing mapping-consistency test to the new RegionUtils.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/FaultInjectionWithAvailabilityStrategyTestsBase.java Adds E2E validation for availability strategy routing with non-canonical preferred regions.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/ExcludeRegionTests.java Adds E2E coverage for non-canonical preferred/exclude region inputs.

Comment on lines +5331 to +5332
assertThat(diagnosticsContext.getContactedRegionNames().iterator().next())
.isEqualTo(expectedFirstRegion);
@xinlian12
Copy link
Copy Markdown
Member

@sdkReviewAgent

return canonical;
}

return regionName;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether we should fallback to normalized version when not found from the map. Even in globalEndpointManager, we just use the normalized version for findings the regional endpoint

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should keep behavior for unknown regions as is? Backward compat mindset - GlobalEndpointManager today just relies on case insensitivity (no space trimming).

* returned as-is.</li>
* </ol>
*/
public class RegionUtils {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also change to use RegionUtils for GlobalEndpointManager as well

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GlobalEndpointManager takes a dependency on LocationHelper for regional URI construction during bootstrap - modified this to take a dependency on RegionUtils (for normalized region suffixing to URI for topology discovery).

@jeet1995
Copy link
Copy Markdown
Member Author

@sdkReviewAgent

// is keyed by server-form-lowercased, so look it up directly.
if (Utils.tryGetValue(regionalRoutingContextsByRegionName, internalExcludeRegion, regionalRoutingContextValueHolder)) {

if (!regionalRoutingContextValueHolder.v.equals(hubRoutingContext)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium: reevaluate global branch missing user-exclude check

The sibling else branch (line 522–523) was correctly updated to use containsNormalizedRegion to prevent re-adding a user-excluded region. But this if branch (when first applicable is the global routing context) has no such guard.

Scenario: User excludes "East US" ("eastus"), PPCB also excludes "east us". After both exclusions, applicableEndpoints is empty → global fallback used. This branch then iterates internalExcludeRegions, finds "east us" in the forward map, and re-adds the East US endpoint as a retry target — violating the user's explicit exclusion.

Suggested fix: Add the same containsNormalizedRegion check here:

if (!regionalRoutingContextValueHolder.v.equals(hubRoutingContext)
        && !containsNormalizedRegion(userConfiguredExcludeRegions,
            RegionUtils.getNormalizedRegionName(internalExcludeRegion))) {
    modifiedRegionalRoutingContexts.add(regionalRoutingContextValueHolder.v);
    break;
}

This review was generated by Copilot and may contain inaccuracies. Please validate any suggestions before applying.

if (regions == null || regions.isEmpty()) {
return false;
}
String canonicalTarget = getCanonicalRegionName(target);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Low: containsRegionIgnoreCase fails for unknown regions

This method canonicalizes both sides then uses equalsIgnoreCase. For known regions (e.g., "westus3""West US 3"), this works. But for unknown regions, getCanonicalRegionName returns the input as-is, so:

containsRegionIgnoreCase(["plutocentral"], "Pluto Central")
// canonicalTarget = "Pluto Central" (as-is, unknown)
// getCanonicalRegionName("plutocentral") = "plutocentral" (as-is, unknown)
// "plutocentral".equalsIgnoreCase("Pluto Central") → false!

The method's Javadoc says "all format variants: 'westus3', 'West US 3', 'WEST US 3' all match each other" — this only holds for known regions.

Consider comparing normalized forms instead of canonical forms:

String normalizedTarget = getNormalizedRegionName(target);
// ...
if (region != null && getNormalizedRegionName(region).equals(normalizedTarget)) {

This method isn't used in production code today, but the contract mismatch could cause confusion if it gets adopted later.


This review was generated by Copilot and may contain inaccuracies. Please validate any suggestions before applying.

String normalizedRegionRoutedTo = this.normalizedRegionLookupMap.get(regionRoutedTo);

int regionId = RegionNameToRegionIdMap.getRegionId(normalizedRegionRoutedTo);
int regionId = RegionUtils.getRegionId(normalizedRegionRoutedTo);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Low: Residual ad-hoc normalizers not updated to use RegionUtils

This line's value comes from normalizedRegionLookupMap, which is populated two lines above (line 124) via:

this.normalizedRegionLookupMap.computeIfAbsent(regionRoutedTo,
    (val) -> val.toLowerCase(Locale.ROOT).trim().replace(" ", ""));

This strips spaces but not hyphens or underscores — so "west-us-3" normalizes to "west-us-3", not "westus3" as RegionUtils.getNormalizedRegionName() would produce. Two other files have the same pattern:

  • PartitionKeyBasedBloomFilter.java:84toLowerCase().replace(" ", "")
  • RegionScopedSessionContainer.java:464toLowerCase().trim().replace(" ", "")

Since this PR establishes RegionUtils.getNormalizedRegionName() as the single normalization function, consider migrating these call sites in a follow-up to avoid divergent normalization. (Not blocking — these paths receive server-form names today, which never contain hyphens/underscores.)


This review was generated by Copilot and may contain inaccuracies. Please validate any suggestions before applying.

@xinlian12
Copy link
Copy Markdown
Member

Review complete (39:17)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Splits the previously monolithic `RegionUtils` into two classes with
disjoint responsibilities:

- `RegionNameNormalizer` (pure) - lowercase + strip [-_ ]. Used wherever
  the SDK needs case- and separator-insensitive equality between two region
  name strings it already has in hand: `LocationCache` endpoint resolution
  and `LocationHelper` URL construction. Has no static region map, so a
  region the SDK has not yet learned about still routes correctly.

- `RegionIdRegistry` (map-aware) - the only place that depends on the
  compiled-in region-id table. Exposes `getRegionId(String)` and
  `getNormalizedRegionNameForId(int)`. Used only by
  `PartitionScopedRegionLevelProgress` (session-token region-level
  progress) and `RxDocumentClientImpl` (multi-region readiness probe).

This separation removes a foot-gun where overloaded
`getNormalizedRegionName(String)` (pure) and
`getNormalizedRegionName(int)` (map-dependent, returns "" for unknown
IDs) shared a name. It also frees endpoint resolution from the SDK's
compile-time region inventory entirely.

Dead-code methods `getCanonicalRegionName`, `canonicalizeRegionNames`,
and `containsRegionIgnoreCase` are removed; the only callers were the
old `RegionUtilsNormalizationTest`, which is dropped.

Renames the surviving tests to match the new class names:
- `RegionUtilsTests` -> `RegionIdRegistryTests`
- `RegionUtilsSettingsXmlValidationTest` -> `RegionIdRegistrySettingsXmlValidationTest`

No behavioral change. `mvn -P unit verify` passes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the squad/region-name-mapper branch 5 times, most recently from 93cdb82 to 779bbd7 Compare May 29, 2026 23:57
@jeet1995
Copy link
Copy Markdown
Member Author

@sdkReviewAgent

…utingContexts

Mirrors the existing guard in the sibling else-branch so that when PPCB drives the path into the global-fallback branch, regions on the user exclude list are not re-added as retry targets. Addresses xinlian12's review comment at LocationCache.java:507.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* that {@link #getApplicableRegionRoutingContexts} uses for the user-exclude check, so that unknown
* regions in any input form (e.g., "westus3" vs "West US 3") match consistently.
*/
private static boolean containsNormalizedRegion(List<String> userExcludeRegions, String normalizedTarget) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — Design: Pass prebuilt normalizedExcludes set to reevaluate() instead of raw list

getApplicableRegionRoutingContexts already builds a HashSet<String> normalizedExcludes at lines 353–362, but reevaluate() receives the raw userConfiguredExcludeRegions list. This containsNormalizedRegion method then re-normalizes every entry on each call — duplicating the work that was already done.

Why it matters beyond efficiency: This data-flow split is also the root cause of the unguarded global branch at line 507 (raised by @xinlian12). If reevaluate received normalizedExcludes instead of the raw list, adding the user-exclude check to the global branch would be a one-liner (normalizedExcludes.contains(RegionNameNormalizer.normalize(internalExcludeRegion))), and this entire containsNormalizedRegion method could be removed.

Suggested refactoring:

  1. Change reevaluate() signature: replace List<String> userConfiguredExcludeRegionsSet<String> normalizedExcludes
  2. Line 472: !userConfiguredExcludeRegions.isEmpty()!normalizedExcludes.isEmpty()
  3. Line 507 (global branch): add && !normalizedExcludes.contains(RegionNameNormalizer.normalize(internalExcludeRegion))
  4. Line 523 (else branch): !containsNormalizedRegion(...)!normalizedExcludes.contains(RegionNameNormalizer.normalize(internalExcludeRegion))
  5. Delete containsNormalizedRegion() entirely

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

}

@Test(groups = "unit")
public void preferredRegions_noSpacesShouldMatchCanonical() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — Testing: No routing test exercises hyphenated or underscored preferred regions

The PR title explicitly calls out "west-us-3" and "west_us_3" as newly supported input forms, and RegionNameNormalizer.normalize() strips both characters. But the LocationCacheTest normalization suite only exercises three variants:

  • "west us 3" (lowercase with spaces) ✅
  • "westus3" (no spaces) ✅
  • "WEST US 3" (uppercase) ✅
  • "west-us-3" (hyphenated) ❌ missing
  • "west_us_3" (underscored) ❌ missing

The hyphen strip is only tested via LocationHelperTest (URL suffix), and underscores are not tested through any routing path at all.

Why it matters: If someone modifies RegionNameNormalizer.normalize() and accidentally drops the .replace("-", "") or .replace("_", "") call, no LocationCacheTest would catch the regression in the routing layer. The LocationHelperTest would catch hyphen removal for URL generation, but that's a different concern than endpoint resolution.

Suggested addition: Two tests mirroring this one:

@Test(groups = "unit")
public void preferredRegions_hyphenatedShouldMatchCanonical() {
    LocationCache cache = createCacheWithRealRegions(
        Arrays.asList("west-us-3", "east-us", "north-europe"));
    // assert same endpoint ordering as canonical
}

@Test(groups = "unit")
public void preferredRegions_underscoredShouldMatchCanonical() {
    LocationCache cache = createCacheWithRealRegions(
        Arrays.asList("west_us_3", "east_us", "north_europe"));
    // assert same endpoint ordering as canonical
}

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

* Returns the normalized form of a region name, or the input itself if it is null or empty.
* Pure string transform; never throws and never allocates a string when the input is empty.
*/
public static String normalize(String regionName) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — Testing: No dedicated unit test for the foundational normalizer

RegionNameNormalizer.normalize() is the single normalization function that the entire routing pipeline (preferred lookups, exclude matching, PPCB reevaluate, URL suffix generation) depends on. There is no RegionNameNormalizerTest — the function is only exercised indirectly via LocationCacheTest, LocationHelperTest, and RegionIdRegistryTests.

Why it matters: If the normalization contract changes (e.g., someone adds Unicode normalization, changes the StringUtils.isEmpty guard to handle whitespace-only differently, or reorders the replace calls), there's no safety net that tests the normalizer's own behavior in isolation. Key untested edge cases:

  • normalize(null) → should return null (current behavior via StringUtils.isEmpty)
  • normalize("") → should return ""
  • normalize(" ") → produces "" (the space is stripped)
  • normalize("East_US-2") → should produce "eastus2" (mixed separators)
  • Idempotency: normalize(normalize("West US 3")) == normalize("West US 3")

The Python SDK has a dedicated TestNormalizeRegionName class with explicit collision-safety tests (e.g., verifying "East US""East US 2" after normalization). Adding the same would be cheap and high-value.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

@xinlian12
Copy link
Copy Markdown
Member

Review complete (53:41)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Replaces inline toLowerCase + replace patterns in PartitionScopedRegionLevelProgress, PartitionKeyBasedBloomFilter, and RegionScopedSessionContainer with RegionNameNormalizer.normalize so all region-name normalization in azure-cosmos uses a single helper. Addresses xinlian12's PartitionScopedRegionLevelProgress comment (non-blocking) and removes the now-unused java.util.Locale imports.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Copy Markdown
Member Author

@sdkReviewAgent

@jeet1995
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).


for (Map.Entry<String, Integer> sourceEntry : RegionNameToRegionIdMap.REGION_NAME_TO_REGION_ID_MAPPINGS.entrySet()) {
for (Map.Entry<String, Integer> sourceEntry : RegionIdRegistry.CANONICAL_REGION_NAME_TO_REGION_ID_MAPPINGS.entrySet()) {
String normalizedRegionNameFromSource = sourceEntry.getKey().toLowerCase(Locale.ROOT).replace(" ", "").trim();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Correctness — Test: RegionIdRegistryTests uses divergent normalization from production code

This test normalizes canonical region names using the old algorithm:

sourceEntry.getKey().toLowerCase(Locale.ROOT).replace(" ", "").trim()

But the production NORMALIZED_REGION_NAME_TO_REGION_ID_MAPPINGS is built using RegionNameNormalizer.normalize(), which additionally strips hyphens and underscores:

regionName.toLowerCase(Locale.ROOT).replace(" ", "").replace("-", "").replace("_", "")

Why this matters: The test currently passes only because no canonical region name contains hyphens or underscores. But the test is validating map consistency using a different normalization function than the one that built the map — so it's providing false confidence. If a future region like "Delos-Cloud Germany West" (which is already in the registry at ID 109) had a hyphen in its canonical name, the test would normalize it to "delos-cloudgermanywest" while the production map key would be "deloscloudgermanywest". The assertion at line 29 would fail with a misleading error about map inconsistency.

Fix: Replace line 24 with:

String normalizedRegionNameFromSource = RegionNameNormalizer.normalize(sourceEntry.getKey());

This makes the test prove the actual derivation path rather than an approximation of it.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

// PPCB marks "east us" (server-returned, lowercased with spaces) as unavailable
RxDocumentServiceRequest request = RxDocumentServiceRequest.create(
mockDiagnosticsClientContext(), OperationType.Read, ResourceType.Document);
request.requestContext.setExcludeRegions(Arrays.asList("eastus"));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Correctness — Test: Regression test does not exercise the containsNormalizedRegion fix

This test claims to verify that the reevaluate path doesn't re-add a user-excluded region when PPCB also excluded it. But it cannot reach the containsNormalizedRegion code (lines 511/527) due to the early-return guard at line 440.

Trace through the test scenario:

  • 3 regions configured, user excludes 1 ("eastus")
  • After user-exclude filtering: 2 endpoints remain (West US + Central US)
  • PPCB exclude "east us" matches East US, but it's already removed → endpointsRemovedByInternalExcludeRegions is empty
  • applicableRegionalRoutingContexts.size() == 2reevaluate returns at line 440 before containsNormalizedRegion is ever called

The test passes for the wrong reason — it validates the main exclude loop normalization, not the reevaluate path. The specific PPCB bug (user-excluded region silently re-added as retry target) has zero unit test coverage.

Fix: Exclude 2 of 3 regions so only 1 remains after filtering, forcing reevaluate to enter its inner logic:

request.requestContext.setExcludeRegions(Arrays.asList("eastus", "westus"));
request.requestContext.setUnavailableRegionsForPerPartitionCircuitBreaker(
    Arrays.asList("east us"));

Then assert containsNormalizedRegion blocks "east us" from being re-added (West US was user-excluded, not PPCB-excluded, so the reevaluate loop won't try to re-add it — only "east us" would be attempted).

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

* Fast path: tries the raw key first (zero allocations when input is already normalized).
* Slow path: normalizes via {@link RegionNameNormalizer#normalize(String)} and retries.
*/
public static int getRegionId(String regionName) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — Testing: No unit tests for getRegionId() or getNormalizedRegionNameForId() public API

RegionIdRegistryTests validates internal map consistency, and RegionIdRegistrySettingsXmlValidationTest validates sync with Settings.xml. But neither exercises the public API methods that callers actually invoke.

getRegionId() has non-trivial branching logic (fast path → slow path with normalization fallback) and getNormalizedRegionNameForId() has an empty-string sentinel for unknown IDs. None of these behaviors are directly tested.

Suggested addition:

@Test(groups = {"unit"})
public void getRegionIdVariousFormats() {
    // Fast path (already normalized)
    assertThat(RegionIdRegistry.getRegionId("eastus")).isEqualTo(1);
    // Slow path (canonical form → normalize → retry)
    assertThat(RegionIdRegistry.getRegionId("East US")).isEqualTo(1);
    // Hyphenated input
    assertThat(RegionIdRegistry.getRegionId("east-us")).isEqualTo(1);
    // Underscored input
    assertThat(RegionIdRegistry.getRegionId("east_us")).isEqualTo(1);
    // Edge cases
    assertThat(RegionIdRegistry.getRegionId(null)).isEqualTo(-1);
    assertThat(RegionIdRegistry.getRegionId("")).isEqualTo(-1);
    assertThat(RegionIdRegistry.getRegionId("Pluto Central")).isEqualTo(-1);
    // Reverse lookup
    assertThat(RegionIdRegistry.getNormalizedRegionNameForId(1)).isEqualTo("eastus");
    assertThat(RegionIdRegistry.getNormalizedRegionNameForId(999)).isEqualTo("");
}

This pins the fast-path/slow-path contract and the edge-case sentinels.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

private UnmodifiableList<String> availableReadLocations;
private UnmodifiableMap<String, RegionalRoutingContext> availableWriteRegionalRoutingContextsByRegionName;
private UnmodifiableMap<String, RegionalRoutingContext> availableReadRegionalRoutingContextsByRegionName;
// Parallel maps keyed by normalized region name (lowercase, no spaces/separators). Used
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Performance: CaseInsensitiveMap is redundant for normalized-keyed maps

The availableXxxRegionalRoutingContextsByNormalizedRegionName maps are initialized as CaseInsensitiveMap, but all keys inserted come from RegionNameNormalizer.normalize() which already lowercases. All lookup keys also go through RegionNameNormalizer.normalize() before query (line 906, 615, 659). The case-insensitive comparison in CaseInsensitiveMap.convertKey() allocates a char[] and does per-character lowercasing on every get/containsKey — work that's always a no-op on already-lowercase strings.

Using HashMap for the normalized maps would:

  1. Avoid the redundant per-lookup allocation
  2. Self-document the invariant that keys are pre-normalized (if someone accidentally passes a non-normalized key, the lookup fails fast rather than silently matching)

This is defense-in-depth vs. efficiency+clarity. Not blocking — the current approach is correct, just slightly wasteful.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

// mostPreferredLocation came from preferredLocations (normalized form) or
// effectivePreferredLocations (server-form). Look up against the parallel
// normalized map after normalizing the key to handle both cases.
String normalizedMostPreferredLocation = RegionNameNormalizer.normalize(mostPreferredLocation);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Maintainability: Document the mixed-form contract between preferredLocations and effectivePreferredLocations

This normalization call is needed because mostPreferredLocation might come from either:

  • preferredLocationsalready normalized (via the DatabaseAccountLocationsInfo constructor at line 1148-1150)
  • effectivePreferredLocationsserver-form (e.g., "West US 3", populated from raw getName() at line 992)

The two lists represent the same concept (ordered region preference) but are in different string forms. This works correctly today because you normalize here before lookup, but a future developer adding a direct lookup against the normalized map using effectivePreferredLocations without normalizing first would get a silent miss.

Consider adding a field-level comment on effectivePreferredLocations (line 1123) documenting that it stores server-form (not normalized) region names, unlike preferredLocations which stores normalized form. This costs nothing and prevents a subtle future bug.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

@xinlian12
Copy link
Copy Markdown
Member

Review complete (16:58)

Posted 5 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@jeet1995
Copy link
Copy Markdown
Member Author

Perf benchmark results — 756f5005 (latest HEAD of squad/region-name-mapper)

Ran a 30-min point-read benchmark on a Standard_D16s_v5 in East US 2 against thin-client-multi-writer-ci. preferredRegions=[West Central US, East US 2], excludedRegions=["WEst-centRAL-us"] (intentionally hyphenated to exercise RegionNameNormalizer).

Metric baseline azure-cosmos:4.80.0 iter-run1 (PR head 2f414799) iter-run2 (PR head re-run) iter-final-v2 (HEAD 756f5005)
Total Doc/Read 41,058,558 35,417,782 37,491,069 37,787,407
Throughput (ops/s) 22,868 19,730 20,879 21,655
Mean (ms) 0.83 0.97 0.92 0.91
p50 (ms) 0.00 0.00 0.00 0.00
p90 (ms) 0.00 0.00 0.00 0.00
p95 (ms) 1.00 1.21 1.15 1.29
p99 (ms) 2.08 1.54 1.54 1.69
Max (ms) 139.58 89.54 251.27 193.70

Micrometer histogram buckets coarser than 1 ms — so p50/p90 are reported as 0 across all runs; only p95+ are meaningful here.

Reads

  • 100% routed to cdb-ms-prod-eastus2-* — the hyphenated exclude "WEst-centRAL-us" is normalized correctly via RegionNameNormalizer.normalize and drops the WCUS endpoint from the applicable set.
  • Throughput within 5% of the PR head re-runs (+3.7% vs iter-run2, −5.3% vs baseline 4.80.0 — both inside the prior iter-run1↔iter-run2 noise band of +5.85%).
  • p99 is 1.69 ms vs baseline 2.08 ms (−18.8%) — the original PR p99 win holds at the latest HEAD.
  • No allocation, GC pause, or thread-count anomalies vs the prior runs.

What's in 756f5005 since 2f414799 (the head these comparisons were against before)

  1. Refactor RegionUtils into pure normalizer + map-aware registry (779bbd7281d) — split into RegionNameNormalizer + RegionIdRegistry, deleted RegionUtils.
  2. Honor user-exclude in global-fallback branch of getApplicableRegionRoutingContexts (90a6e5833be) — closes the gap @xinlian12 flagged on LocationCache.java:507.
  3. Migrate residual ad-hoc region normalizers to RegionNameNormalizer (756f5005a3a) — migrates PartitionScopedRegionLevelProgress, PartitionKeyBasedBloomFilter, RegionScopedSessionContainer from inline toLowerCase + replace to the shared helper.

Verdict: the latest changes are perf-neutral and the original PR's −26% p99 win is preserved.

Note: the benchmark module on this branch was missing excludedRegionsList wiring (the field on TenantDefaultConfig and the setExcludedRegions call in AsyncReadBenchmark already exist on origin/main). I patched those two files on the VM build only — happy to cherry-pick them into a commit on this branch in a separate PR if folks would find it useful for future perf testing.

@jeet1995
Copy link
Copy Markdown
Member Author

🧪 PPAF DR Drill Report — Region Name Normalization Validation

Environment: Test61 | Date: 2026-05-30 UTC | Result: ✅ ALL 7 CRITERIA PASSED


Test Setup

Parameter Value
Cosmos JAR azure-cosmos 4.81.0-beta.1 (built from this PR's branch)
Preferred Regions (non-normalized) northcentralus, eastus2, centralus
Account Regions NCU (P0, write) → EU2 (P1, read) → CU (P2, read)
Consistency Strong
PPAF Enabled
Connection Modes DIRECT + GATEWAY (parallel)
Quorum Loss All 3 NCU partitions, 600s, AllReplicas mode

Timeline

gantt
    title PPAF DR Drill Timeline (UTC)
    dateFormat HH:mm
    axisFormat %H:%M

    section Workloads
    DIRECT workload           :active, d1, 02:38, 03:18
    GATEWAY workload          :active, g1, 02:40, 03:20

    section Fault Injection
    Quorum Loss (NCU)         :crit, ql, 02:59, 03:09

    section DIRECT Routing
    NCU baseline              :done, db, 02:38, 02:59
    Failover to EU2           :crit, df, 02:59, 03:09
    Gradual failback to NCU   :milestone, 03:10, 03:15
    NCU restored              :done, dr, 03:15, 03:18

    section GATEWAY Routing
    NCU baseline              :done, gb, 02:40, 02:59
    Failover to EU2           :crit, gf, 02:59, 03:09
    Gradual failback to NCU   :milestone, 03:10, 03:14
    NCU restored              :done, gr, 03:14, 03:20
Loading

Failover latency: ~23 seconds | Failback: ~5 minutes after QL end


Region Distribution — DIRECT Mode

%%{init: {'theme': 'neutral'}}%%
xychart-beta
    title "DIRECT Mode — Contacted Region % by Phase"
    x-axis ["Pre-QL", "During QL", "Post-QL"]
    y-axis "% of Operations" 0 --> 100
    bar [98, 1, 64]
    bar [0.3, 89, 18]
    bar [1.4, 10, 18]
Loading
Phase North Central US East US 2 Central US
Pre-QL ████████████████████ 98% (4,692) ░ <1% (14) ░ 1% (65)
During QL ░ 1% (25) ██████████████████ 89% (2,084) ██ 10% (227)
Post-QL █████████████ 64% (1,025) ████ 18% (292) ████ 18% (293)

Region Distribution — GATEWAY Mode

Phase North Central US East US 2 Central US
Pre-QL ████████████████████ 99.7% (4,248) ░ <1% (12) ░ <1% (2)
During QL ░ 3% (50) ████████████████████ 97% (1,897) ░ <1% (3)
Post-QL ████████████████ 80% (1,391) ███ 16% (271) █ 4% (71)

Final Operation Counts

Mode Operation Successes Failures Failure Rate
DIRECT Create 2,766 125 4.3%
DIRECT Read 3,374 86 2.5%
DIRECT Query 2,086 294 12.4%
GATEWAY Create 2,387 6 0.3%
GATEWAY Read 3,183 19 0.6%
GATEWAY Query 1,955 297 13.2%

Success Criteria

# Criterion Result
1 Non-normalized regions (northcentralus) accepted without error
2 Operations route to NCU initially
3 Reads failover to East US 2 during quorum loss
4 Writes failover to next account-level write region
5 Operations resume in NCU after quorum loss ends
6 No normalization mismatch errors
7 Both DIRECT and GATEWAY behave correctly

Key Observations

  1. Normalization is transparent: northcentralusnorth central us happened at the SDK layer. All log entries show the canonical form in contactedRegions. Zero IllegalArgumentException or UnknownRegion errors.

  2. PPAF circuit breaker worked correctly: GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover detected NCU unavailability and marked it as failed within ~23 seconds.

  3. GATEWAY mode had fewer transient failures: Create failures were 0.3% (GW) vs 4.3% (DIRECT) — the gateway proxy absorbs more transient errors during failover.

  4. Gradual failback: After QL ended, operations didn't snap back instantly. There was a ~5-minute gradual recovery period where operations were distributed across all 3 regions before fully returning to NCU. This is expected PPAF behavior.

Conclusion: PR #49090 correctly normalizes non-canonical preferred region names. PPAF failover and failback work identically whether regions are passed as northcentralus or North Central US.

@jeet1995
Copy link
Copy Markdown
Member Author

Investigation — 22,868 → 21,654 ops/s (−5.3%) delta vs baseline

TL;DR: not an SDK regression. The latest HEAD is actually more CPU- and GC-efficient than baseline 4.80.0; the throughput delta is fully explained by backend-pod scatter on this run.

Backend pod distribution (full benchmark log, all RNTBD endpoints)

Run Unique pods Read-pod distribution
baseline 4.80.0 2 eastus2-be455 only (82,578 samples)
iter-run1 (2f414799) 2 eastus2-be197 only
iter-run2 (2f414799 re-run) 2 eastus2-be270 only (77,456 samples)
iter-final-v2 (756f5005) 16 eastus2-be460 (15,232) + be232 (10,156) + be480 + be353 + 11 more

All three prior runs concentrated 100% of reads on a single hot pod with a warm connection pool. iter-final-v2's run hit 16 different EastUS2 replica pods (control-plane stats too). The thin-client thin-client-multi-writer-ci account's gateway returned a different replica set this time — orthogonal to SDK code.

SDK-internal cost per op (from monitor.csv, steady-state, throughput-normalized)

Run CPU µs/op GC ms/op Heap MB avg
baseline 4.80.0 30.2 0.042 5,912
iter-run1 29.8 0.040
iter-run2 27.6 0.038 5,991
iter-final-v2 29.5 0.037 5,945

iter-final-v2 vs baseline 4.80.0: CPU per op −2.3%, GC time per op −12%. So the new normalization helpers (RegionNameNormalizer.normalize + the per-request HashSet<String> build) aren't adding SDK overhead; the per-op CPU is actually slightly better than baseline 4.80.0 because the canonical-region-ID lookup hot-path now short-circuits on the fast (already-normalized) form.

Where the latency went

Mean read latency: baseline 0.83 ms → iter-final-v2 0.91 ms. The +80 µs gap matches the expected cost of routing across 16 cold-ish backend pods vs one warm pod (RNTBD connection establishment + lower pool re-use). Connection pool size is minConnectionPoolSizePerEndpoint=1, so the first request to each new pod pays a TCP/TLS+RNTBD handshake.

Falsifying signals checked

Hypothesis Verdict
Normalization adds CPU on hot path ✗ — CPU/op went down relative to baseline
New per-request HashSet allocation drives GC ✗ — GC ms/op went down; bytes/op delta vs prior runs was 0.65% in original 3-way
LocationCache:507 user-exclude guard adds work ✗ — that branch only fires on global-fallback; this workload never hits it
Residual-normalizer migration in PartitionScopedRegionLevelProgress slows session routing ✗ — same toLowerCase + replace " " semantics, no observed thread-pool churn
Backend pod scatter ✓ — 16 pods vs 1, latency consistent with cold-connection cost

Recommendation

Re-running iter-final-v2 on a day when the account's gateway happens to return a single pod (as in all three prior runs) should put it within ±2% of baseline. The SDK changes are perf-neutral; the prior PR-vs-baseline −13.7% / −8.7% throughput numbers from iter-run1/iter-run2 were similarly skewed in the opposite direction (different but still-single pod). Backend pod variance dominates a 30,000-RU shared CI account at this concurrency.

The original PR's p99 −18.8% win over baseline (2.08 ms → 1.69 ms) still holds at HEAD — that's the user-visible result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants