Skip to content

Fix JSONDecodeError when JSON-LD script is HTML-entity-encoded#249

Open
gaoflow wants to merge 1 commit into
scrapinghub:masterfrom
gaoflow:fix/html-entities-in-jsonld
Open

Fix JSONDecodeError when JSON-LD script is HTML-entity-encoded#249
gaoflow wants to merge 1 commit into
scrapinghub:masterfrom
gaoflow:fix/html-entities-in-jsonld

Conversation

@gaoflow

@gaoflow gaoflow commented Jun 25, 2026

Copy link
Copy Markdown

Problem

Some websites incorrectly HTML-encode the content of application/ld+json script tags, producing sequences like &quot; or &#34; instead of literal ". When lxml extracts the text of such a <script> element it preserves those entities as-is, so json.loads receives invalid JSON and raises a JSONDecodeError.

Minimal reproduction (fixes #208):

from extruct.jsonld import JsonLdExtractor

html = b'''<html><body>
<script type="application/ld+json">
{&quot;@context&quot;:&quot;http://schema.org/&quot;,&quot;@type&quot;:&quot;Product&quot;}
</script></body></html>'''

JsonLdExtractor().extract(html)
# Before fix:
# json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

Fix

Call html.unescape() on the extracted script text before passing it to json.loads. This converts &quot;", &amp;&, &#34;", etc. Since html.unescape is a no-op on plain text that contains no entities, existing valid JSON-LD is unaffected.

The change is a 2-line addition to _extract_items in extruct/jsonld.py.

Tests

Added test_jsonld_with_html_entities covering:

  • Named entity &quot; for double-quotes
  • Numeric reference &#34; for double-quotes
  • &amp; in a string value

This pull request was prepared with the assistance of AI, under my direction and review.

Some sites incorrectly HTML-encode the content of application/ld+json
script tags (e.g. &quot; instead of "), causing json.loads to raise a
JSONDecodeError. Apply html.unescape() before parsing so named entities
like &quot; and &amp;, and numeric references like &scrapinghub#34;, are decoded
to their character equivalents before the JSON parser sees them.

Fixes scrapinghub#208
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

&quot; in application/ld+json gives exception

1 participant