[Cosmos][WIP]: Normalize region names passed as preferred or exclude regions.#49090
[Cosmos][WIP]: Normalize region names passed as preferred or exclude regions.#49090jeet1995 wants to merge 51 commits into
Conversation
Customers passing region names in non-canonical forms (e.g., 'west us 3' instead of 'West US 3') hit routing issues because the Java SDK stores region names in different forms and some comparisons use case-sensitive String.equals()/List.contains(). Changes: - Add RegionNameMapper: strips spaces + case-insensitive lookup against 90+ known Azure regions to produce canonical names (e.g., 'westus3' or 'west us 3' -> 'West US 3'). Unknown regions pass through as-is. - ConnectionPolicy.setPreferredRegions(): normalize + order-preserving dedupe at entry point. - LocationCache constructor: apply RegionNameMapper before toLowerCase for defense-in-depth. - Fix case-sensitive List.contains() bug in reevaluate() (line 502): use containsRegionIgnoreCase() instead. - Normalize user-configured exclude regions at point of use in getApplicableRegionRoutingContexts() to prevent mismatches with PPCB-derived lowercased region names. - Add RegionNameMapperTest with 43 unit tests covering case variants, space removal, passthrough, null/empty handling. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The static region list in RegionNameMapper goes stale when new Azure regions are added. Fix: add a ConcurrentHashMap-backed dynamic tier that learns canonical region names from server responses. - RegionNameMapper.registerRegionName(): registers canonical names from DatabaseAccountLocation (called from LocationCache.addRoutingContexts). After the first account read, even new regions like 'West US 4' can normalize 'westus4' → 'West US 4'. - getCosmosDBRegionName(): checks static map first, then dynamic map. - Add 2 new tests for dynamic registration behavior. - 45/45 RegionNameMapperTest pass, 32/32 LocationCacheTest pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous commit had stash conflict markers (<<<<<<< Updated upstream / >>>>>>> Stashed changes) left in RegionNameMapper.java and RegionNameMapperTest.java. Rewrote both files clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Merge the separate RegionNameMapper into RegionNameToRegionIdMap as the single source of truth for region names. This eliminates maintaining two parallel region lists that can drift out of sync. Changes: - Delete RegionNameMapper.java — normalization logic moved into RegionNameToRegionIdMap. - RegionNameToRegionIdMap now provides region ID mapping (existing) AND region name normalization (new) from one canonical list. - Sync REGION_NAME_TO_REGION_ID_MAPPINGS with backend RegionToIdMap.cs: add Bleu France Central/South (107/108), Delos Cloud Germany Central/North (109/110), Singapore Central/North (111/112), fix 'easteurope' → 'East Europe' (54). - Build NORMALIZED_REGION_NAME_TO_REGION_ID_MAPPINGS programmatically from REGION_NAME_TO_REGION_ID_MAPPINGS instead of manual duplication. - Normalization static map seeded from ID map keys + additional regions without IDs yet (from .NET SDK Regions.cs). - Rename test: RegionNameMapperTest → RegionNameToRegionIdMapNormalizationTest. - Update ConnectionPolicy and LocationCache references. - All 78 tests pass (45 normalization + 32 LocationCache + 1 consistency). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 7 tests to LocationCacheTest using real Azure region names to verify that preferred regions and exclude regions work correctly with non-canonical input: - preferredRegions_lowercaseShouldMatchCanonical: 'west us 3' → West US 3 - preferredRegions_noSpacesShouldMatchCanonical: 'westus3' → West US 3 - preferredRegions_uppercaseShouldMatchCanonical: 'WEST US 3' → West US 3 - preferredRegions_duplicateAfterNormalizationShouldDedupe: 'westus3' + 'West US 3' deduped to single entry - excludeRegions_lowercaseNoSpacesShouldExclude: 'westus3' excludes West US 3 - excludeRegions_mixedCasingShouldExclude: 'EAST us' excludes East US - excludeRegions_requestLevelNoSpacesShouldExclude: request-level 'eastus' excludes East US All 39 LocationCacheTest unit tests pass (32 existing + 7 new). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create a second CosmosClient with space-stripped preferred regions (e.g., 'westus3' instead of 'west us 3') and verify that routing and region exclusion work identically to canonical names. New tests: - nonCanonicalPreferredRegions_shouldRouteCorrectly: client with space-stripped preferred regions routes to correct first region (7 operation types via DataProvider) - nonCanonicalExcludeRegion_shouldSkipExcludedRegion: excluding with space-stripped name (e.g., 'westus3') correctly skips that region (7 operation types via DataProvider) - uppercaseExcludeRegion_shouldSkipExcludedRegion: excluding with UPPERCASE name correctly skips that region Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add tests that create CosmosClients with space-stripped preferred regions (e.g., 'westus3' instead of 'West US 3') and verify correct routing. FaultInjectionWithAvailabilityStrategyTestsBase: - Add nonCanonicalWriteableRegions field (space-stripped from server names) - readAfterCreation_nonCanonicalPreferredRegions_shouldRouteCorrectly: creates client with space-stripped regions, reads with eager availability strategy, verifies first contacted region matches expected canonical name PerPartitionCircuitBreakerE2ETests: - nonCanonicalPreferredRegions_ppcbShouldStillRouteCorrectly: creates client with space-stripped regions, performs create+read, verifies diagnostics show routing to correct first preferred region Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Simplify RegionNameToRegionIdMap by removing the ConcurrentHashMap-backed dynamic registration tier. Unknown regions are returned as-is, which is sufficient because LocationCache's CaseInsensitiveMap + toLowerCase handles the matching for any region the server returns. - Remove DYNAMIC_NORMALIZED_TO_CANONICAL and registerRegionName() - Remove registerRegionName() call from LocationCache.addRoutingContexts() - Replace dynamic registration tests with passthrough assertion tests - 84/84 tests pass (44 normalization + 39 LocationCache + 1 consistency) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sible Duplicate preferred regions after normalization (e.g., ['westus3', 'West US 3'] both becoming 'West US 3') are an obvious customer misconfiguration. The SDK should not silently mask this — let the duplicates pass through so the customer can see and fix their config. Also clarify code comments for the escape hatch behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 10 regions from the authoritative LocationNames.cs that were missing from the normalization map: East US SLV, Southeast US, Southwest US, South Central US 2, Southeast US 3, Southeast US 5, Northeast US 5, India South Central, Southeast Asia 3, West Central US FRE. Region ID mappings remain a subset (only regions with assigned IDs from RegionToIdMap.cs). The normalization map is the superset sourced from LocationNames.cs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the previous RegionToIdMap.cs-based ID map with the complete regionToIdMapping from Settings.xml (IDs 1-124). This is the authoritative source for region name ↔ ID mappings used for session token region-level progress tracking. - Add 44 new region IDs (74-124): Brazil Southeast, West US 3, Qatar Central, Italy North, East US 3, Saudi Arabia East, etc. - Remove separate 'additional canonical names' block — all canonical names now derive from the ID map since Settings.xml is the superset. - Remove 'Greece Central' which was not in any authoritative source. - Update javadoc and code comments to reference Settings.xml as the authoritative source instead of RegionToIdMap.cs. - 83/83 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ationCache - Rename class to RegionUtils — better reflects its dual role (ID mapping + region name normalization). - Move normalizeRegionNames() and containsRegionIgnoreCase() from LocationCache private helpers into RegionUtils as public static methods. - Rename all test files to match: RegionUtilsNormalizationTest, RegionUtilsTests. - Update all references across ConnectionPolicy, LocationCache, PartitionScopedRegionLevelProgress, RxDocumentClientImpl. - 83/83 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@sdkReviewAgent |
There was a problem hiding this comment.
Pull request overview
This PR addresses Cosmos DB routing mismatches caused by non-canonical Azure region name inputs by introducing centralized region normalization and applying it to preferred/excluded region handling across the Cosmos Java SDK routing stack.
Changes:
- Added
RegionUtilsas the single source of truth for region ID mappings and canonical region name normalization, and updated call sites to use it. - Normalized preferred/excluded regions in
ConnectionPolicyandLocationCache, including a fix for a case-sensitive exclude-region check in PPCB reevaluation logic. - Added/updated unit and E2E tests to validate routing behavior with non-canonical region inputs and updated the Cosmos changelog entry.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java | Switches region ID lookup to RegionUtils. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/RegionUtils.java | Introduces region normalization + region ID mapping utilities. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/RegionNameToRegionIdMap.java | Removes the old region mapping class in favor of RegionUtils. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/LocationCache.java | Normalizes excluded regions and fixes PPCB exclude-region comparison behavior. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/PartitionScopedRegionLevelProgress.java | Updates region ID/name lookups to RegionUtils. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ConnectionPolicy.java | Normalizes preferred regions at configuration time. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the normalization + PPCB exclude-region fix. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/PerPartitionCircuitBreakerE2ETests.java | Adds E2E coverage for PPCB routing with non-canonical preferred regions. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/routing/RegionUtilsNormalizationTest.java | Adds unit coverage for normalization behavior. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/routing/LocationCacheTest.java | Adds integration-style unit tests for preferred/exclude region normalization with real region names. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/RegionUtilsTests.java | Updates the existing mapping-consistency test to the new RegionUtils. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/FaultInjectionWithAvailabilityStrategyTestsBase.java | Adds E2E validation for availability strategy routing with non-canonical preferred regions. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/ExcludeRegionTests.java | Adds E2E coverage for non-canonical preferred/exclude region inputs. |
| assertThat(diagnosticsContext.getContactedRegionNames().iterator().next()) | ||
| .isEqualTo(expectedFirstRegion); |
|
@sdkReviewAgent |
| return canonical; | ||
| } | ||
|
|
||
| return regionName; |
There was a problem hiding this comment.
I am wondering whether we should fallback to normalized version when not found from the map. Even in globalEndpointManager, we just use the normalized version for findings the regional endpoint
There was a problem hiding this comment.
I feel we should keep behavior for unknown regions as is? Backward compat mindset - GlobalEndpointManager today just relies on case insensitivity (no space trimming).
| * returned as-is.</li> | ||
| * </ol> | ||
| */ | ||
| public class RegionUtils { |
There was a problem hiding this comment.
should we also change to use RegionUtils for GlobalEndpointManager as well
There was a problem hiding this comment.
GlobalEndpointManager takes a dependency on LocationHelper for regional URI construction during bootstrap - modified this to take a dependency on RegionUtils (for normalized region suffixing to URI for topology discovery).
|
@sdkReviewAgent |
| // is keyed by server-form-lowercased, so look it up directly. | ||
| if (Utils.tryGetValue(regionalRoutingContextsByRegionName, internalExcludeRegion, regionalRoutingContextValueHolder)) { | ||
|
|
||
| if (!regionalRoutingContextValueHolder.v.equals(hubRoutingContext)) { |
There was a problem hiding this comment.
🟡 Medium: reevaluate global branch missing user-exclude check
The sibling else branch (line 522–523) was correctly updated to use containsNormalizedRegion to prevent re-adding a user-excluded region. But this if branch (when first applicable is the global routing context) has no such guard.
Scenario: User excludes "East US" ("eastus"), PPCB also excludes "east us". After both exclusions, applicableEndpoints is empty → global fallback used. This branch then iterates internalExcludeRegions, finds "east us" in the forward map, and re-adds the East US endpoint as a retry target — violating the user's explicit exclusion.
Suggested fix: Add the same containsNormalizedRegion check here:
if (!regionalRoutingContextValueHolder.v.equals(hubRoutingContext)
&& !containsNormalizedRegion(userConfiguredExcludeRegions,
RegionUtils.getNormalizedRegionName(internalExcludeRegion))) {
modifiedRegionalRoutingContexts.add(regionalRoutingContextValueHolder.v);
break;
}This review was generated by Copilot and may contain inaccuracies. Please validate any suggestions before applying.
| if (regions == null || regions.isEmpty()) { | ||
| return false; | ||
| } | ||
| String canonicalTarget = getCanonicalRegionName(target); |
There was a problem hiding this comment.
🟡 Low: containsRegionIgnoreCase fails for unknown regions
This method canonicalizes both sides then uses equalsIgnoreCase. For known regions (e.g., "westus3" → "West US 3"), this works. But for unknown regions, getCanonicalRegionName returns the input as-is, so:
containsRegionIgnoreCase(["plutocentral"], "Pluto Central")
// canonicalTarget = "Pluto Central" (as-is, unknown)
// getCanonicalRegionName("plutocentral") = "plutocentral" (as-is, unknown)
// "plutocentral".equalsIgnoreCase("Pluto Central") → false!The method's Javadoc says "all format variants: 'westus3', 'West US 3', 'WEST US 3' all match each other" — this only holds for known regions.
Consider comparing normalized forms instead of canonical forms:
String normalizedTarget = getNormalizedRegionName(target);
// ...
if (region != null && getNormalizedRegionName(region).equals(normalizedTarget)) {This method isn't used in production code today, but the contract mismatch could cause confusion if it gets adopted later.
This review was generated by Copilot and may contain inaccuracies. Please validate any suggestions before applying.
| String normalizedRegionRoutedTo = this.normalizedRegionLookupMap.get(regionRoutedTo); | ||
|
|
||
| int regionId = RegionNameToRegionIdMap.getRegionId(normalizedRegionRoutedTo); | ||
| int regionId = RegionUtils.getRegionId(normalizedRegionRoutedTo); |
There was a problem hiding this comment.
🟡 Low: Residual ad-hoc normalizers not updated to use RegionUtils
This line's value comes from normalizedRegionLookupMap, which is populated two lines above (line 124) via:
this.normalizedRegionLookupMap.computeIfAbsent(regionRoutedTo,
(val) -> val.toLowerCase(Locale.ROOT).trim().replace(" ", ""));This strips spaces but not hyphens or underscores — so "west-us-3" normalizes to "west-us-3", not "westus3" as RegionUtils.getNormalizedRegionName() would produce. Two other files have the same pattern:
PartitionKeyBasedBloomFilter.java:84—toLowerCase().replace(" ", "")RegionScopedSessionContainer.java:464—toLowerCase().trim().replace(" ", "")
Since this PR establishes RegionUtils.getNormalizedRegionName() as the single normalization function, consider migrating these call sites in a follow-up to avoid divergent normalization. (Not blocking — these paths receive server-form names today, which never contain hyphens/underscores.)
This review was generated by Copilot and may contain inaccuracies. Please validate any suggestions before applying.
|
✅ Review complete (39:17) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
Splits the previously monolithic `RegionUtils` into two classes with disjoint responsibilities: - `RegionNameNormalizer` (pure) - lowercase + strip [-_ ]. Used wherever the SDK needs case- and separator-insensitive equality between two region name strings it already has in hand: `LocationCache` endpoint resolution and `LocationHelper` URL construction. Has no static region map, so a region the SDK has not yet learned about still routes correctly. - `RegionIdRegistry` (map-aware) - the only place that depends on the compiled-in region-id table. Exposes `getRegionId(String)` and `getNormalizedRegionNameForId(int)`. Used only by `PartitionScopedRegionLevelProgress` (session-token region-level progress) and `RxDocumentClientImpl` (multi-region readiness probe). This separation removes a foot-gun where overloaded `getNormalizedRegionName(String)` (pure) and `getNormalizedRegionName(int)` (map-dependent, returns "" for unknown IDs) shared a name. It also frees endpoint resolution from the SDK's compile-time region inventory entirely. Dead-code methods `getCanonicalRegionName`, `canonicalizeRegionNames`, and `containsRegionIgnoreCase` are removed; the only callers were the old `RegionUtilsNormalizationTest`, which is dropped. Renames the surviving tests to match the new class names: - `RegionUtilsTests` -> `RegionIdRegistryTests` - `RegionUtilsSettingsXmlValidationTest` -> `RegionIdRegistrySettingsXmlValidationTest` No behavioral change. `mvn -P unit verify` passes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
93cdb82 to
779bbd7
Compare
|
@sdkReviewAgent |
…utingContexts Mirrors the existing guard in the sibling else-branch so that when PPCB drives the path into the global-fallback branch, regions on the user exclude list are not re-added as retry targets. Addresses xinlian12's review comment at LocationCache.java:507. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| * that {@link #getApplicableRegionRoutingContexts} uses for the user-exclude check, so that unknown | ||
| * regions in any input form (e.g., "westus3" vs "West US 3") match consistently. | ||
| */ | ||
| private static boolean containsNormalizedRegion(List<String> userExcludeRegions, String normalizedTarget) { |
There was a problem hiding this comment.
🟡 Recommendation — Design: Pass prebuilt normalizedExcludes set to reevaluate() instead of raw list
getApplicableRegionRoutingContexts already builds a HashSet<String> normalizedExcludes at lines 353–362, but reevaluate() receives the raw userConfiguredExcludeRegions list. This containsNormalizedRegion method then re-normalizes every entry on each call — duplicating the work that was already done.
Why it matters beyond efficiency: This data-flow split is also the root cause of the unguarded global branch at line 507 (raised by @xinlian12). If reevaluate received normalizedExcludes instead of the raw list, adding the user-exclude check to the global branch would be a one-liner (normalizedExcludes.contains(RegionNameNormalizer.normalize(internalExcludeRegion))), and this entire containsNormalizedRegion method could be removed.
Suggested refactoring:
- Change
reevaluate()signature: replaceList<String> userConfiguredExcludeRegions→Set<String> normalizedExcludes - Line 472:
!userConfiguredExcludeRegions.isEmpty()→!normalizedExcludes.isEmpty() - Line 507 (global branch): add
&& !normalizedExcludes.contains(RegionNameNormalizer.normalize(internalExcludeRegion)) - Line 523 (else branch):
!containsNormalizedRegion(...)→!normalizedExcludes.contains(RegionNameNormalizer.normalize(internalExcludeRegion)) - Delete
containsNormalizedRegion()entirely
| } | ||
|
|
||
| @Test(groups = "unit") | ||
| public void preferredRegions_noSpacesShouldMatchCanonical() { |
There was a problem hiding this comment.
🟡 Recommendation — Testing: No routing test exercises hyphenated or underscored preferred regions
The PR title explicitly calls out "west-us-3" and "west_us_3" as newly supported input forms, and RegionNameNormalizer.normalize() strips both characters. But the LocationCacheTest normalization suite only exercises three variants:
"west us 3"(lowercase with spaces) ✅"westus3"(no spaces) ✅"WEST US 3"(uppercase) ✅"west-us-3"(hyphenated) ❌ missing"west_us_3"(underscored) ❌ missing
The hyphen strip is only tested via LocationHelperTest (URL suffix), and underscores are not tested through any routing path at all.
Why it matters: If someone modifies RegionNameNormalizer.normalize() and accidentally drops the .replace("-", "") or .replace("_", "") call, no LocationCacheTest would catch the regression in the routing layer. The LocationHelperTest would catch hyphen removal for URL generation, but that's a different concern than endpoint resolution.
Suggested addition: Two tests mirroring this one:
@Test(groups = "unit")
public void preferredRegions_hyphenatedShouldMatchCanonical() {
LocationCache cache = createCacheWithRealRegions(
Arrays.asList("west-us-3", "east-us", "north-europe"));
// assert same endpoint ordering as canonical
}
@Test(groups = "unit")
public void preferredRegions_underscoredShouldMatchCanonical() {
LocationCache cache = createCacheWithRealRegions(
Arrays.asList("west_us_3", "east_us", "north_europe"));
// assert same endpoint ordering as canonical
}| * Returns the normalized form of a region name, or the input itself if it is null or empty. | ||
| * Pure string transform; never throws and never allocates a string when the input is empty. | ||
| */ | ||
| public static String normalize(String regionName) { |
There was a problem hiding this comment.
🟡 Recommendation — Testing: No dedicated unit test for the foundational normalizer
RegionNameNormalizer.normalize() is the single normalization function that the entire routing pipeline (preferred lookups, exclude matching, PPCB reevaluate, URL suffix generation) depends on. There is no RegionNameNormalizerTest — the function is only exercised indirectly via LocationCacheTest, LocationHelperTest, and RegionIdRegistryTests.
Why it matters: If the normalization contract changes (e.g., someone adds Unicode normalization, changes the StringUtils.isEmpty guard to handle whitespace-only differently, or reorders the replace calls), there's no safety net that tests the normalizer's own behavior in isolation. Key untested edge cases:
normalize(null)→ should returnnull(current behavior viaStringUtils.isEmpty)normalize("")→ should return""normalize(" ")→ produces""(the space is stripped)normalize("East_US-2")→ should produce"eastus2"(mixed separators)- Idempotency:
normalize(normalize("West US 3")) == normalize("West US 3")
The Python SDK has a dedicated TestNormalizeRegionName class with explicit collision-safety tests (e.g., verifying "East US" ≠ "East US 2" after normalization). Adding the same would be cheap and high-value.
|
✅ Review complete (53:41) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
Replaces inline toLowerCase + replace patterns in PartitionScopedRegionLevelProgress, PartitionKeyBasedBloomFilter, and RegionScopedSessionContainer with RegionNameNormalizer.normalize so all region-name normalization in azure-cosmos uses a single helper. Addresses xinlian12's PartitionScopedRegionLevelProgress comment (non-blocking) and removes the now-unused java.util.Locale imports. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@sdkReviewAgent |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
|
||
| for (Map.Entry<String, Integer> sourceEntry : RegionNameToRegionIdMap.REGION_NAME_TO_REGION_ID_MAPPINGS.entrySet()) { | ||
| for (Map.Entry<String, Integer> sourceEntry : RegionIdRegistry.CANONICAL_REGION_NAME_TO_REGION_ID_MAPPINGS.entrySet()) { | ||
| String normalizedRegionNameFromSource = sourceEntry.getKey().toLowerCase(Locale.ROOT).replace(" ", "").trim(); |
There was a problem hiding this comment.
🔴 Correctness — Test: RegionIdRegistryTests uses divergent normalization from production code
This test normalizes canonical region names using the old algorithm:
sourceEntry.getKey().toLowerCase(Locale.ROOT).replace(" ", "").trim()But the production NORMALIZED_REGION_NAME_TO_REGION_ID_MAPPINGS is built using RegionNameNormalizer.normalize(), which additionally strips hyphens and underscores:
regionName.toLowerCase(Locale.ROOT).replace(" ", "").replace("-", "").replace("_", "")Why this matters: The test currently passes only because no canonical region name contains hyphens or underscores. But the test is validating map consistency using a different normalization function than the one that built the map — so it's providing false confidence. If a future region like "Delos-Cloud Germany West" (which is already in the registry at ID 109) had a hyphen in its canonical name, the test would normalize it to "delos-cloudgermanywest" while the production map key would be "deloscloudgermanywest". The assertion at line 29 would fail with a misleading error about map inconsistency.
Fix: Replace line 24 with:
String normalizedRegionNameFromSource = RegionNameNormalizer.normalize(sourceEntry.getKey());This makes the test prove the actual derivation path rather than an approximation of it.
| // PPCB marks "east us" (server-returned, lowercased with spaces) as unavailable | ||
| RxDocumentServiceRequest request = RxDocumentServiceRequest.create( | ||
| mockDiagnosticsClientContext(), OperationType.Read, ResourceType.Document); | ||
| request.requestContext.setExcludeRegions(Arrays.asList("eastus")); |
There was a problem hiding this comment.
🔴 Correctness — Test: Regression test does not exercise the containsNormalizedRegion fix
This test claims to verify that the reevaluate path doesn't re-add a user-excluded region when PPCB also excluded it. But it cannot reach the containsNormalizedRegion code (lines 511/527) due to the early-return guard at line 440.
Trace through the test scenario:
- 3 regions configured, user excludes 1 (
"eastus") - After user-exclude filtering: 2 endpoints remain (West US + Central US)
- PPCB exclude
"east us"matches East US, but it's already removed →endpointsRemovedByInternalExcludeRegionsis empty applicableRegionalRoutingContexts.size() == 2→reevaluatereturns at line 440 beforecontainsNormalizedRegionis ever called
The test passes for the wrong reason — it validates the main exclude loop normalization, not the reevaluate path. The specific PPCB bug (user-excluded region silently re-added as retry target) has zero unit test coverage.
Fix: Exclude 2 of 3 regions so only 1 remains after filtering, forcing reevaluate to enter its inner logic:
request.requestContext.setExcludeRegions(Arrays.asList("eastus", "westus"));
request.requestContext.setUnavailableRegionsForPerPartitionCircuitBreaker(
Arrays.asList("east us"));Then assert containsNormalizedRegion blocks "east us" from being re-added (West US was user-excluded, not PPCB-excluded, so the reevaluate loop won't try to re-add it — only "east us" would be attempted).
| * Fast path: tries the raw key first (zero allocations when input is already normalized). | ||
| * Slow path: normalizes via {@link RegionNameNormalizer#normalize(String)} and retries. | ||
| */ | ||
| public static int getRegionId(String regionName) { |
There was a problem hiding this comment.
🟡 Recommendation — Testing: No unit tests for getRegionId() or getNormalizedRegionNameForId() public API
RegionIdRegistryTests validates internal map consistency, and RegionIdRegistrySettingsXmlValidationTest validates sync with Settings.xml. But neither exercises the public API methods that callers actually invoke.
getRegionId() has non-trivial branching logic (fast path → slow path with normalization fallback) and getNormalizedRegionNameForId() has an empty-string sentinel for unknown IDs. None of these behaviors are directly tested.
Suggested addition:
@Test(groups = {"unit"})
public void getRegionIdVariousFormats() {
// Fast path (already normalized)
assertThat(RegionIdRegistry.getRegionId("eastus")).isEqualTo(1);
// Slow path (canonical form → normalize → retry)
assertThat(RegionIdRegistry.getRegionId("East US")).isEqualTo(1);
// Hyphenated input
assertThat(RegionIdRegistry.getRegionId("east-us")).isEqualTo(1);
// Underscored input
assertThat(RegionIdRegistry.getRegionId("east_us")).isEqualTo(1);
// Edge cases
assertThat(RegionIdRegistry.getRegionId(null)).isEqualTo(-1);
assertThat(RegionIdRegistry.getRegionId("")).isEqualTo(-1);
assertThat(RegionIdRegistry.getRegionId("Pluto Central")).isEqualTo(-1);
// Reverse lookup
assertThat(RegionIdRegistry.getNormalizedRegionNameForId(1)).isEqualTo("eastus");
assertThat(RegionIdRegistry.getNormalizedRegionNameForId(999)).isEqualTo("");
}This pins the fast-path/slow-path contract and the edge-case sentinels.
| private UnmodifiableList<String> availableReadLocations; | ||
| private UnmodifiableMap<String, RegionalRoutingContext> availableWriteRegionalRoutingContextsByRegionName; | ||
| private UnmodifiableMap<String, RegionalRoutingContext> availableReadRegionalRoutingContextsByRegionName; | ||
| // Parallel maps keyed by normalized region name (lowercase, no spaces/separators). Used |
There was a problem hiding this comment.
🟢 Suggestion — Performance: CaseInsensitiveMap is redundant for normalized-keyed maps
The availableXxxRegionalRoutingContextsByNormalizedRegionName maps are initialized as CaseInsensitiveMap, but all keys inserted come from RegionNameNormalizer.normalize() which already lowercases. All lookup keys also go through RegionNameNormalizer.normalize() before query (line 906, 615, 659). The case-insensitive comparison in CaseInsensitiveMap.convertKey() allocates a char[] and does per-character lowercasing on every get/containsKey — work that's always a no-op on already-lowercase strings.
Using HashMap for the normalized maps would:
- Avoid the redundant per-lookup allocation
- Self-document the invariant that keys are pre-normalized (if someone accidentally passes a non-normalized key, the lookup fails fast rather than silently matching)
This is defense-in-depth vs. efficiency+clarity. Not blocking — the current approach is correct, just slightly wasteful.
| // mostPreferredLocation came from preferredLocations (normalized form) or | ||
| // effectivePreferredLocations (server-form). Look up against the parallel | ||
| // normalized map after normalizing the key to handle both cases. | ||
| String normalizedMostPreferredLocation = RegionNameNormalizer.normalize(mostPreferredLocation); |
There was a problem hiding this comment.
🟢 Suggestion — Maintainability: Document the mixed-form contract between preferredLocations and effectivePreferredLocations
This normalization call is needed because mostPreferredLocation might come from either:
preferredLocations— already normalized (via theDatabaseAccountLocationsInfoconstructor at line 1148-1150)effectivePreferredLocations— server-form (e.g.,"West US 3", populated from rawgetName()at line 992)
The two lists represent the same concept (ordered region preference) but are in different string forms. This works correctly today because you normalize here before lookup, but a future developer adding a direct lookup against the normalized map using effectivePreferredLocations without normalizing first would get a silent miss.
Consider adding a field-level comment on effectivePreferredLocations (line 1123) documenting that it stores server-form (not normalized) region names, unlike preferredLocations which stores normalized form. This costs nothing and prevents a subtle future bug.
|
✅ Review complete (16:58) Posted 5 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
Perf benchmark results —
|
| Metric | baseline azure-cosmos:4.80.0 |
iter-run1 (PR head 2f414799) |
iter-run2 (PR head re-run) | iter-final-v2 (HEAD 756f5005) |
|---|---|---|---|---|
| Total Doc/Read | 41,058,558 | 35,417,782 | 37,491,069 | 37,787,407 |
| Throughput (ops/s) | 22,868 | 19,730 | 20,879 | 21,655 |
| Mean (ms) | 0.83 | 0.97 | 0.92 | 0.91 |
| p50 (ms) | 0.00 | 0.00 | 0.00 | 0.00 |
| p90 (ms) | 0.00 | 0.00 | 0.00 | 0.00 |
| p95 (ms) | 1.00 | 1.21 | 1.15 | 1.29 |
| p99 (ms) | 2.08 | 1.54 | 1.54 | 1.69 |
| Max (ms) | 139.58 | 89.54 | 251.27 | 193.70 |
Micrometer histogram buckets coarser than 1 ms — so p50/p90 are reported as 0 across all runs; only p95+ are meaningful here.
Reads
- 100% routed to
cdb-ms-prod-eastus2-*— the hyphenated exclude"WEst-centRAL-us"is normalized correctly viaRegionNameNormalizer.normalizeand drops the WCUS endpoint from the applicable set. - Throughput within 5% of the PR head re-runs (
+3.7%vs iter-run2,−5.3%vs baseline 4.80.0 — both inside the prior iter-run1↔iter-run2 noise band of+5.85%). - p99 is
1.69 msvs baseline2.08 ms(−18.8%) — the original PR p99 win holds at the latest HEAD. - No allocation, GC pause, or thread-count anomalies vs the prior runs.
What's in 756f5005 since 2f414799 (the head these comparisons were against before)
Refactor RegionUtils into pure normalizer + map-aware registry(779bbd7281d) — split intoRegionNameNormalizer+RegionIdRegistry, deletedRegionUtils.Honor user-exclude in global-fallback branch of getApplicableRegionRoutingContexts(90a6e5833be) — closes the gap @xinlian12 flagged onLocationCache.java:507.Migrate residual ad-hoc region normalizers to RegionNameNormalizer(756f5005a3a) — migratesPartitionScopedRegionLevelProgress,PartitionKeyBasedBloomFilter,RegionScopedSessionContainerfrom inlinetoLowerCase + replaceto the shared helper.
Verdict: the latest changes are perf-neutral and the original PR's −26% p99 win is preserved.
Note: the benchmark module on this branch was missing
excludedRegionsListwiring (the field onTenantDefaultConfigand thesetExcludedRegionscall inAsyncReadBenchmarkalready exist onorigin/main). I patched those two files on the VM build only — happy to cherry-pick them into a commit on this branch in a separate PR if folks would find it useful for future perf testing.
🧪 PPAF DR Drill Report — Region Name Normalization ValidationEnvironment: Test61 | Date: 2026-05-30 UTC | Result: ✅ ALL 7 CRITERIA PASSED Test Setup
Timelinegantt
title PPAF DR Drill Timeline (UTC)
dateFormat HH:mm
axisFormat %H:%M
section Workloads
DIRECT workload :active, d1, 02:38, 03:18
GATEWAY workload :active, g1, 02:40, 03:20
section Fault Injection
Quorum Loss (NCU) :crit, ql, 02:59, 03:09
section DIRECT Routing
NCU baseline :done, db, 02:38, 02:59
Failover to EU2 :crit, df, 02:59, 03:09
Gradual failback to NCU :milestone, 03:10, 03:15
NCU restored :done, dr, 03:15, 03:18
section GATEWAY Routing
NCU baseline :done, gb, 02:40, 02:59
Failover to EU2 :crit, gf, 02:59, 03:09
Gradual failback to NCU :milestone, 03:10, 03:14
NCU restored :done, gr, 03:14, 03:20
Failover latency: ~23 seconds | Failback: ~5 minutes after QL end Region Distribution — DIRECT Mode%%{init: {'theme': 'neutral'}}%%
xychart-beta
title "DIRECT Mode — Contacted Region % by Phase"
x-axis ["Pre-QL", "During QL", "Post-QL"]
y-axis "% of Operations" 0 --> 100
bar [98, 1, 64]
bar [0.3, 89, 18]
bar [1.4, 10, 18]
Region Distribution — GATEWAY Mode
Final Operation Counts
Success Criteria
Key Observations
|
Investigation —
|
| Run | Unique pods | Read-pod distribution |
|---|---|---|
| baseline 4.80.0 | 2 | eastus2-be455 only (82,578 samples) |
iter-run1 (2f414799) |
2 | eastus2-be197 only |
iter-run2 (2f414799 re-run) |
2 | eastus2-be270 only (77,456 samples) |
iter-final-v2 (756f5005) |
16 | eastus2-be460 (15,232) + be232 (10,156) + be480 + be353 + 11 more |
All three prior runs concentrated 100% of reads on a single hot pod with a warm connection pool. iter-final-v2's run hit 16 different EastUS2 replica pods (control-plane stats too). The thin-client thin-client-multi-writer-ci account's gateway returned a different replica set this time — orthogonal to SDK code.
SDK-internal cost per op (from monitor.csv, steady-state, throughput-normalized)
| Run | CPU µs/op | GC ms/op | Heap MB avg |
|---|---|---|---|
| baseline 4.80.0 | 30.2 | 0.042 | 5,912 |
| iter-run1 | 29.8 | 0.040 | — |
| iter-run2 | 27.6 | 0.038 | 5,991 |
| iter-final-v2 | 29.5 | 0.037 | 5,945 |
iter-final-v2 vs baseline 4.80.0: CPU per op −2.3%, GC time per op −12%. So the new normalization helpers (RegionNameNormalizer.normalize + the per-request HashSet<String> build) aren't adding SDK overhead; the per-op CPU is actually slightly better than baseline 4.80.0 because the canonical-region-ID lookup hot-path now short-circuits on the fast (already-normalized) form.
Where the latency went
Mean read latency: baseline 0.83 ms → iter-final-v2 0.91 ms. The +80 µs gap matches the expected cost of routing across 16 cold-ish backend pods vs one warm pod (RNTBD connection establishment + lower pool re-use). Connection pool size is minConnectionPoolSizePerEndpoint=1, so the first request to each new pod pays a TCP/TLS+RNTBD handshake.
Falsifying signals checked
| Hypothesis | Verdict |
|---|---|
| Normalization adds CPU on hot path | ✗ — CPU/op went down relative to baseline |
New per-request HashSet allocation drives GC |
✗ — GC ms/op went down; bytes/op delta vs prior runs was 0.65% in original 3-way |
| LocationCache:507 user-exclude guard adds work | ✗ — that branch only fires on global-fallback; this workload never hits it |
| Residual-normalizer migration in PartitionScopedRegionLevelProgress slows session routing | ✗ — same toLowerCase + replace " " semantics, no observed thread-pool churn |
| Backend pod scatter | ✓ — 16 pods vs 1, latency consistent with cold-connection cost |
Recommendation
Re-running iter-final-v2 on a day when the account's gateway happens to return a single pod (as in all three prior runs) should put it within ±2% of baseline. The SDK changes are perf-neutral; the prior PR-vs-baseline −13.7% / −8.7% throughput numbers from iter-run1/iter-run2 were similarly skewed in the opposite direction (different but still-single pod). Backend pod variance dominates a 30,000-RU shared CI account at this concurrency.
The original PR's p99 −18.8% win over baseline (2.08 ms → 1.69 ms) still holds at HEAD — that's the user-visible result.
Problem
Customers passing region names in non-canonical forms (
"westus3","WEST US 3","west-us-3","west_us_3") hit routing failures because the SDK stored regions in inconsistent shapes and some code paths used case-sensitiveString.equals()/List.contains()against server-returned names like"west us 3". A related bug in PPCB re-evaluation silently re-added user-excluded regions as retry targets when the case didn't match.Definitions
"West US 3","East US". Sourced from Settings.xmlregionToIdMapping."westus3". So"West US 3","WEST-US-3","west_us_3", and"westus3"all normalize to"westus3". Used as internal routing-map key and URL suffix.getName().toLowerCase(Locale.ROOT):"west us 3"— lowercase, spaces preserved.Design rationale
One source of truth.
RegionUtilsowns the canonical ↔ normalized ↔ region-id mapping. Static map keyed by canonical names, derived maps built once with fail-fast on duplicates.getCanonicalRegionName(x)returns canonical form for known regions and the input as-is for unknown ones;getNormalizedRegionName(x)always returns the normalized form.Public API preserves customer input.
ConnectionPolicy.setPreferredRegions(...)andCosmosExcludedRegionsstore the customer's raw input verbatim. Diagnostics report what the customer supplied. All canonicalization is internal toLocationCache.Two-map routing inside
LocationCache. The forward endpoint map is now stored in two parallel forms:availableXxxRegionalRoutingContextsByRegionName— keyed by server-form-lowercased ("west us 3"). Unchanged; preserves the contract used byresolveServiceEndpoint,resolveFaultInjectionEndpoint, the PPCB internal-exclude path, and the publicgetRegionNamediagnostic surface.availableXxxRegionalRoutingContextsByNormalizedRegionName— keyed bygetNormalizedRegionName(serverName)("westus3"). Used only for customer-supplied preferred-region lookups ingetPreferredAvailableRoutingContextsandshouldRefreshEndpoints.DatabaseAccountLocationsInfo.preferredLocationsis stored in normalized form so the lookup is a constant-time hit.This split means any input variant — canonical, normalized, hyphenated, underscored, or an unknown region in no-space form — resolves consistently for routing, while diagnostics keep returning the server form.
Exclude regions are pre-normalized once per request inside
getApplicableRegionRoutingContexts(nulls dropped), then matched in normalized form against the normalized server-region name. No per-request cache, no unbounded growth, no NPE on null entries.Bug fix: PPCB re-evaluate
LocationCache.reevaluate()previously compared user-excluded regions with case-sensitiveList.contains(), so"West US 3".contains("west us 3") returnedfalseand a user-excluded region was silently re-added as a retry target when PPCB had also excluded it. Now matched via normalized form on both sides.Tests
RegionUtilsNormalizationTest, 20 inLocationCacheTest(including unknown-region routing tests), 1 PPCB re-evaluate regression inApplicableRegionEvaluatorTest, ID-map consistency check inRegionUtilsTests. Full azure-cosmos unit suite — 2469 tests — passes.azure-cosmos-tests): non-canonical preferred + exclude + fault injection across 7 op types inExcludeRegionTests; non-canonical + eager availability strategy + 503 hedging; non-canonical + PPCB + 503 circuit-breaker failover.RegionNameNormalizationE2EITest, new): shared test underazure-cosmos-spark_3/.../com/azure/cosmos/spark/, exercised against bothazure-cosmos-spark_3-5_2-12(Spark 3.5 LTS) andazure-cosmos-spark_4-1_2-13(Spark 4.1) via the existingcopy-shared-test-sourcesplumbing. Two scenarios: (1) probe round-trip — discover the SDK's default canonical region fromCosmosDiagnostics, feed back its normalized form, assert routing pins to canonical; (2) differential — pick a readable region distinct from the probe default, normalize and feed aspreferredRegionsList, assert reads route to the canonical secondary (proving routing moved as a result of mapper canonicalization). Captured via a test-scopedCosmosDiagnosticsHandlerregistered throughTestCosmosClientBuilderInterceptor.