fix: normalize OpenAI multi-part content in _get_events_for_messages by octo-patch · Pull Request #1805 · NVIDIA-NeMo/Guardrails

octo-patch · 2026-04-21T02:49:53Z

Problem

When the OpenAI chat completions API sends messages with multi-part content (e.g. "content": [{"type": "text", "text": "Hello"}]), llmrails.py stored the raw list directly in event text fields without normalizing to a string. This caused two failures:

Garbled prompts — self-check and intent-matching prompts received the Python repr of the list ([{'type': 'text', 'text': '...'}]) instead of the actual user text.
TypeError crash — in multi-turn conversations where mask_prev_user_message fires, get_colang_history() called rsplit() on the list and raised TypeError: must be str or None, not list.

Root cause: _get_events_for_messages set "final_transcript": msg["content"] and "text": msg["content"] without checking whether content is a list.

Solution

Add a static helper method LLMRails._extract_text_content(content) that:

Returns plain strings unchanged.
For list content, joins all {"type": "text"} parts with a space (image and other non-text parts are silently ignored, consistent with how the rest of the codebase handles multimodal content).

Apply the helper to every user-message content assignment in _get_events_for_messages for both the Colang 1.0 and 2.0 code paths.

Testing

Added unit tests for _extract_text_content (plain string, single text part, multiple text parts, image parts ignored).
Added integration tests verifying that generate_async no longer crashes with single-turn and multi-turn multi-part content.
All 6 new tests pass; existing test_llmrails.py tests unaffected.

Summary by CodeRabbit

Bug Fixes
- Improved handling of multi-part message content formats to prevent crashes and enhance compatibility with OpenAI messages
Tests
- Added comprehensive test coverage for message content normalization across various input formats

…ixes NVIDIA-NeMo#1741) When the OpenAI chat completions API is used with multi-part content (e.g. `"content": [{"type": "text", "text": "…"}]`), llmrails.py was storing the raw list in event text fields instead of a plain string. This caused two problems: 1. Self-check / intent-matching LLM prompts received the Python repr of the list instead of the actual user text. 2. In multi-turn conversations `get_colang_history()` crashed with `TypeError: must be str or None, not list` at the `rsplit()` call. Fix: add a static helper `LLMRails._extract_text_content()` that joins all `{"type": "text"}` parts into a single string (silently ignoring image and other non-text parts), and apply it to every place in `_get_events_for_messages` where user-message content is stored in an event field — for both Colang 1.0 and 2.0 paths. Tests added for the helper function and for the two failure modes described in the issue.

greptile-apps · 2026-04-21T02:51:39Z

Greptile Summary

This PR fixes issue #1741 by adding LLMRails._extract_text_content() to normalize OpenAI multi-part content lists (e.g. [{\"type\": \"text\", \"text\": \"…\"}]) into plain strings before they are written into Colang events, preventing both garbled prompts and the TypeError crash in multi-turn flows.

The fix is applied consistently across all four user-message content sites in both the Colang 1.0 and 2.0 paths. Unit and integration tests cover the new behavior well.

Confidence Score: 5/5

Safe to merge — all remaining findings are minor style suggestions with no correctness impact.

The fix is narrowly scoped, covers all affected call sites, and is backed by both unit and integration tests. No P0/P1 issues were found.

No files require special attention.

Important Files Changed

Filename	Overview
nemoguardrails/rails/llm/llmrails.py	Adds `_extract_text_content` static helper and applies it at all four user-message content assignment sites in both Colang 1.0 and 2.0 paths of `_get_events_for_messages`; minor edge case with empty text-part keys.
tests/test_llmrails.py	Adds 4 unit tests for `_extract_text_content` and 2 async integration tests covering single-turn and multi-turn multi-part content; tests are functional but each unit test creates an unnecessary `LLMRails` instance for a static method.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["msg['content'] received"] --> B{isinstance list?}
    B -- Yes --> C["Filter parts where type == 'text'"]
    C --> D["join text values with space"]
    D --> E["Return normalized string"]
    B -- No --> F{content is None?}
    F -- Yes --> G["Return ''"]
    F -- No --> H["Return content as-is"]
    E --> I["Set final_transcript / text in event"]
    G --> I
    H --> I

Prompt To Fix All With AI

This is a comment left during a code review.
Path: nemoguardrails/rails/llm/llmrails.py
Line: 611-615

Comment:
**Malformed text parts yield leading/trailing spaces**

When a content part has `"type": "text"` but is missing the `"text"` key, `part.get("text", "")` returns `""`, and `" ".join(["", "hello"])` produces `" hello"` (a leading space). Filtering out empty parts before joining avoids this:

```suggestion
            return " ".join(
                part["text"]
                for part in content
                if isinstance(part, dict) and part.get("type") == "text" and part.get("text")
            )
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/test_llmrails.py
Line: 1486-1526

Comment:
**Unit tests instantiate `LLMRails` unnecessarily for a `@staticmethod`**

`_extract_text_content` is a static method and has no dependency on instance state, so each of the four unit tests creating a full `RailsConfig` + `LLMRails` adds setup overhead without benefit. Calling it as a class method is cleaner:

```python
assert LLMRails._extract_text_content("hello") == "hello"
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix: normalize OpenAI multi-part content..." | Re-trigger Greptile}

greptile-apps · 2026-04-21T02:51:43Z

+            return " ".join(
+                part.get("text", "")
+                for part in content
+                if isinstance(part, dict) and part.get("type") == "text"
+            )


Malformed text parts yield leading/trailing spaces

When a content part has "type": "text" but is missing the "text" key, part.get("text", "") returns "", and " ".join(["", "hello"]) produces " hello" (a leading space). Filtering out empty parts before joining avoids this:

Suggested change

return " ".join(

part.get("text", "")

for part in content

if isinstance(part, dict) and part.get("type") == "text"

)

return " ".join(

part["text"]

for part in content

if isinstance(part, dict) and part.get("type") == "text" and part.get("text")

)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemoguardrails/rails/llm/llmrails.py Line: 611-615 Comment: **Malformed text parts yield leading/trailing spaces** When a content part has `"type": "text"` but is missing the `"text"` key, `part.get("text", "")` returns `""`, and `" ".join(["", "hello"])` produces `" hello"` (a leading space). Filtering out empty parts before joining avoids this: ```suggestion return " ".join( part["text"] for part in content if isinstance(part, dict) and part.get("type") == "text" and part.get("text") ) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-21T02:51:44Z

+def test_extract_text_content_string():
+    """Plain strings should be returned unchanged."""
+    config = RailsConfig.from_content(config={"models": []})
+    rails = LLMRails(config=config, llm=FakeLLM(responses=[]))
+
+    assert rails._extract_text_content("hello") == "hello"
+    assert rails._extract_text_content("") == ""
+    assert rails._extract_text_content(None) == ""
+
+
+def test_extract_text_content_list_single_text_part():
+    """A single-part list with type=text should return the text string."""
+    config = RailsConfig.from_content(config={"models": []})
+    rails = LLMRails(config=config, llm=FakeLLM(responses=[]))
+
+    content = [{"type": "text", "text": "Hello there"}]
+    assert rails._extract_text_content(content) == "Hello there"
+
+
+def test_extract_text_content_list_multiple_text_parts():
+    """Multiple text parts should be joined with a space."""
+    config = RailsConfig.from_content(config={"models": []})
+    rails = LLMRails(config=config, llm=FakeLLM(responses=[]))
+
+    content = [
+        {"type": "text", "text": "First part."},
+        {"type": "text", "text": "Second part."},
+    ]
+    assert rails._extract_text_content(content) == "First part. Second part."
+
+
+def test_extract_text_content_list_skips_image_parts():
+    """Image parts should be silently ignored; only text parts are returned."""
+    config = RailsConfig.from_content(config={"models": []})
+    rails = LLMRails(config=config, llm=FakeLLM(responses=[]))
+
+    content = [
+        {"type": "image_url", "image_url": {"url": "http://example.com/img.png"}},
+        {"type": "text", "text": "Describe this image."},
+    ]
+    assert rails._extract_text_content(content) == "Describe this image."


Unit tests instantiate LLMRails unnecessarily for a @staticmethod

_extract_text_content is a static method and has no dependency on instance state, so each of the four unit tests creating a full RailsConfig + LLMRails adds setup overhead without benefit. Calling it as a class method is cleaner:

assert LLMRails._extract_text_content("hello") == "hello"

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/test_llmrails.py Line: 1486-1526 Comment: **Unit tests instantiate `LLMRails` unnecessarily for a `@staticmethod`** `_extract_text_content` is a static method and has no dependency on instance state, so each of the four unit tests creating a full `RailsConfig` + `LLMRails` adds setup overhead without benefit. Calling it as a class method is cleaner: ```python assert LLMRails._extract_text_content("hello") == "hello" ``` How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

coderabbitai · 2026-04-21T02:54:18Z

📝 Walkthrough

Walkthrough

The PR adds a static method to normalize OpenAI multi-part message content format into single strings. It updates message event creation to use extracted text instead of raw content lists, preventing TypeError crashes and ensuring prompts receive actual text rather than list representations.

Changes

Cohort / File(s)	Summary
Content Normalization `nemoguardrails/rails/llm/llmrails.py`	Added `_extract_text_content()` static method to convert OpenAI multi-part content lists (e.g., `[{"type": "text", "text": "..."}]`) into space-joined strings, with fallback to empty string for `None`. Updated `_get_events_for_messages()` to normalize content via this method when populating `UtteranceUserActionFinished.final_transcript`, `UserMessage.text`, and related fields for both Colang 1.0 and 2.x message handling.
Test Coverage `tests/test_llmrails.py`	Added comprehensive test cases for content extraction: plain string passthrough, `None` handling, single and multi-part text lists, and mixed-type part filtering. Added integration tests verifying `generate_async()` handles multi-part content without crashing in single-turn and multi-turn conversation scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: normalizing OpenAI multi-part content in the _get_events_for_messages method, which directly addresses the PR's core fix.
Linked Issues check	✅ Passed	The PR fully addresses issue `#1741` by adding _extract_text_content to normalize multi-part content, updating all affected code paths in _get_events_for_messages, and including comprehensive tests covering unit and integration scenarios.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing issue `#1741`: the new helper method, its application to user message handling, and comprehensive tests for both Colang 1.0 and 2.x paths.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR qualifies for passing: changes are minor (localized bug fix with +22/-4 net code changes) and test results are explicitly documented in PR description.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/test_llmrails.py (1)
1530-1565: Consider asserting normalized transcript content (not only no-crash).

These tests currently prove stability, but they don’t explicitly prove that prompt-facing user text is normalized to a string. Adding an assertion on generated user events/history text would lock in the behavior targeted by issue #1741.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_llmrails.py` around lines 1530 - 1565, Add assertions that
validate normalized user text is a plain string, not only that generate_async
doesn't crash: after calling LLMRails.generate_async (in tests
test_generate_async_multipart_user_content and
test_generate_async_multipart_multi_turn_no_type_error) call the
LLMRails.get_colang_history() (or inspect the returned result/assistant events)
and assert that user-facing entries for the user messages ("Hello!", "Hello
again") are present as str types (e.g., isinstance(entry["content"], str) and
entry["content"].contains the expected text). This locks in the normalization
behavior that the tests target and prevents regressions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemoguardrails/rails/llm/llmrails.py`:
- Around line 601-617: The _extract_text_content function can return non-string
values when content is neither a list nor None; modify _extract_text_content so
every return value is explicitly a str: when content is a list keep the existing
join but ensure each part's .get("text", "") is cast/coerced to str before
joining, and for the non-list branch return str(content) if content is not None
(otherwise ""), guaranteeing _extract_text_content always yields a Python str
for downstream uses of final_transcript/text.

---

Nitpick comments:
In `@tests/test_llmrails.py`:
- Around line 1530-1565: Add assertions that validate normalized user text is a
plain string, not only that generate_async doesn't crash: after calling
LLMRails.generate_async (in tests test_generate_async_multipart_user_content and
test_generate_async_multipart_multi_turn_no_type_error) call the
LLMRails.get_colang_history() (or inspect the returned result/assistant events)
and assert that user-facing entries for the user messages ("Hello!", "Hello
again") are present as str types (e.g., isinstance(entry["content"], str) and
entry["content"].contains the expected text). This locks in the normalization
behavior that the tests target and prevents regressions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: fa0f3665-8ca7-4984-a421-0e6393fc89c3

📥 Commits

Reviewing files that changed from the base of the PR and between b9cd0fb and b2d9121.

📒 Files selected for processing (2)

nemoguardrails/rails/llm/llmrails.py
tests/test_llmrails.py

coderabbitai · 2026-04-21T02:54:21Z

+    @staticmethod
+    def _extract_text_content(content) -> str:
+        """Normalize an OpenAI message content value to a plain string.
+
+        The OpenAI spec allows ``content`` to be either a plain string or a list
+        of content parts (e.g. ``[{"type": "text", "text": "…"}, …]``).  When a
+        list is received, extract and join all ``text`` parts so the rest of the
+        pipeline always operates on a regular string.
+        """
+        if isinstance(content, list):
+            return " ".join(
+                part.get("text", "")
+                for part in content
+                if isinstance(part, dict) and part.get("type") == "text"
+            )
+        return content if content is not None else ""
+


⚠️ Potential issue | 🟡 Minor

Ensure _extract_text_content always returns str on all paths.

At Line 616, non-list non-None values are returned as-is, which can leak non-string types into final_transcript / text and break downstream string operations again.

Proposed hardening

`@staticmethod` def _extract_text_content(content) -> str: @@ - if isinstance(content, list): - return " ".join( - part.get("text", "") - for part in content - if isinstance(part, dict) and part.get("type") == "text" - ) - return content if content is not None else "" + if content is None: + return "" + + if isinstance(content, str): + return content + + if isinstance(content, list): + parts: List[str] = [] + for part in content: + if isinstance(part, dict) and part.get("type") == "text": + text = part.get("text", "") + if text is None: + continue + parts.append(text if isinstance(text, str) else str(text)) + return " ".join(parts) + + return str(content)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemoguardrails/rails/llm/llmrails.py` around lines 601 - 617, The _extract_text_content function can return non-string values when content is neither a list nor None; modify _extract_text_content so every return value is explicitly a str: when content is a list keep the existing join but ensure each part's .get("text", "") is cast/coerced to str before joining, and for the non-list branch return str(content) if content is not None (otherwise ""), guaranteeing _extract_text_content always yields a Python str for downstream uses of final_transcript/text.

codecov · 2026-04-21T02:57:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Pouyanpi force-pushed the develop branch 4 times, most recently from a6be550 to c69efe5 Compare May 6, 2026 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: normalize OpenAI multi-part content in _get_events_for_messages#1805

fix: normalize OpenAI multi-part content in _get_events_for_messages#1805
octo-patch wants to merge 1 commit intoNVIDIA-NeMo:developfrom
octo-patch:fix/issue-1741-multipart-content-normalization

octo-patch commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

greptile-apps Bot commented Apr 21, 2026

Confidence Score: 5/5

Flowchart

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

codecov Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

octo-patch commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Summary by CodeRabbit

Uh oh!

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 21, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 21, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

octo-patch commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading