Skip to content

fix(fetch): fall back when Readability strips hidden SSR content#3922

Open
Christian-Sidak wants to merge 1 commit intomodelcontextprotocol:mainfrom
Christian-Sidak:fix/fetch-ssr-content-fallback
Open

fix(fetch): fall back when Readability strips hidden SSR content#3922
Christian-Sidak wants to merge 1 commit intomodelcontextprotocol:mainfrom
Christian-Sidak:fix/fetch-ssr-content-fallback

Conversation

@Christian-Sidak
Copy link
Copy Markdown

Summary

  • Adds a three-stage fallback to extract_content_from_html() so that pages using progressive SSR (hidden pre-hydration markup) are not silently reduced to a single line of loading-shell text
  • Stage 1: Readability (existing behavior, unchanged for normal sites)
  • Stage 2: readabilipy without Readability JS (less aggressive, does not filter by CSS visibility)
  • Stage 3: Raw markdownify conversion (last resort)
  • Fallback only activates when Readability output is shorter than 1% of the input HTML

Motivation

Sites using progressive server-side rendering (Next.js streaming, Remix deferred, custom Lambda SSR) deliver content in two phases: a small visible loading shell, then the real content in a hidden container (visibility:hidden; position:absolute; top:-9999px) that becomes visible after client-side hydration. Mozilla Readability treats hidden elements as non-content and strips them entirely, causing mcp-server-fetch to return only the loading shell text with no indication that content was lost.

For example, fetching https://runtimeweb.com returns just "Unified Serverless Framework for Full-Stack TypeScript Applications" instead of the full page content.

Changes

  • src/fetch/src/mcp_server_fetch/server.py: Modified extract_content_from_html() to try three extraction stages, falling back only when the previous stage produces disproportionately little text
  • src/fetch/tests/test_server.py: Added 6 unit tests covering all fallback paths, threshold behavior, and no-regression for normal pages

Breaking Changes

None. The fix only activates when Readability extracts less than 1% of the HTML size as text. Normal sites where Readability works correctly are completely unaffected.

Test plan

  • All 6 new fallback tests pass
  • Existing tests unaffected (they test Readability directly via Node.js, independent of this change)
  • No new dependencies added

Fixes #3878

Add a three-stage extraction pipeline to extract_content_from_html():

1. Readability (existing, best quality for standard pages)
2. readabilipy without Readability JS (less aggressive, no CSS visibility filtering)
3. Raw markdownify conversion (last resort)

Stages 2 and 3 only activate when stage 1 produces text shorter than 1% of the
input HTML length, which indicates Readability stripped meaningful content. This
commonly happens with progressive SSR sites that deliver content in hidden
containers (visibility:hidden, position:absolute) awaiting client-side hydration.

No new dependencies. No behavior change for sites where Readability works correctly.

Fixes modelcontextprotocol#3878
Copy link
Copy Markdown
Member

@olaservo olaservo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the strongest of the three Readability fallback PRs (#3879, #3894, #3922):

  • 3-stage pipeline (Readability → readabilipy without Readability → raw markdownify) gives a good quality gradient
  • Proportional 1% threshold scales with page size, unlike a fixed constant
  • Preserves the <error> return for truly empty pages (no behavior change)
  • 6 fully-mocked deterministic tests with good edge case coverage
  • Smallest diff to production code (only 4 lines removed)

We'll close the other two PRs with credit to @morozow for filing the original issue (#3878).


This review was assisted by Claude Code.

@morozow
Copy link
Copy Markdown

morozow commented Apr 14, 2026

@Christian-Sidak @olaservo I analyzed both implementations #3922, #3879 in the context of the MCP fetch server and the expected contract of a transport-level content extractor.
#3922 (current) variant introduces opinionated validation via length thresholds, .strip(), and a hard <error> state. This breaks neutrality: it can discard valid outputs (e.g., whitespace-only content), produces non-data responses, and makes the result dependent on heuristics tied to HTML size. In MCP terms, this violates separation of concerns – the fetch layer should not decide what is "good enough" content.

#3879 variant behaves as a proper extraction primitive: it attempts readability, falls back when needed, and always returns the extracted content without enforcing interpretation or artificial failure states. This keeps the pipeline predictable for agents and preserves full fidelity of the source, which is critical for downstream processing.

#3879 variant is the correct approach for MCP fetch, as it maintains a clean transport contract and avoids embedding policy/validation logic into the extraction layer.

PR needs either refactoring or revert to the implementation from #3879, #3947 to be eligible for merge and properly resolve the issue described in #3878.

@morozow
Copy link
Copy Markdown

morozow commented Apr 14, 2026

@olaservo Updated tests for full edge-case coverage and reopened as #3947

@Christian-Sidak
Copy link
Copy Markdown
Author

Good points, @morozow -- I agree the length threshold and <error> wrapping add opinionation that doesn't belong in the transport layer. Happy to strip those heuristics and just return whatever readabilipy/markdownify produces without any filtering. Note that #3947 was closed -- is there a preferred path forward, or would a simplified version of this PR work?

@cliffhall cliffhall added bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch labels Apr 20, 2026
@cliffhall
Copy link
Copy Markdown
Member

@claude review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 21, 2026

Claude finished @cliffhall's task in 1m 48s —— View job


Code Review

  • Gather context and understand request
  • Read changed files
  • Analyze implementation
  • Post review feedback

This PR addresses a genuine problem: Mozilla Readability strips visibility:hidden SSR containers, leaving only loading-shell text. The three-stage pipeline is a reasonable approach. Below are my observations.


src/fetch/src/mcp_server_fetch/server.py

Threshold heuristic (line 45)

min_expected_length = max(1, len(html) // 100)

This is the most debatable part, as morozow noted. Consider a minimalist landing page with 80 KB of inlined JavaScript but only a few hundred bytes of real visible text — Readability's small, clean output is correct, but this threshold would discard it and fall through to Stage 2 or Stage 3, potentially injecting raw JS noise. The 1% line blurs the boundary between extraction and quality validation. If the concern is specifically about SSR hidden containers, a targeted approach (detecting near-zero output for large HTML) might be more defensible, but any threshold is inherently opinionated.

Stage 3: raw markdownify on full HTML (lines 74–80)

content = markdownify.markdownify(html, heading_style=markdownify.ATX)

markdownify does not strip <script> or <style> blocks — it converts them verbatim to plain text. If Stage 3 activates, the returned Markdown may include large blobs of JavaScript or CSS that are noisy for downstream LLM processing. Consider filtering those tags before applying markdownify:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style"]):
    tag.decompose()
content = markdownify.markdownify(str(soup), heading_style=markdownify.ATX)

BeautifulSoup is already an indirect dependency (via readabilipy), so this wouldn't add a new dependency.

Code duplication (lines 47–72)

Stages 1 and 2 share identical structure: call simple_json_from_html_string, extract content, markdownify, and check the threshold. A small helper function would reduce duplication and make it easier to add a Stage 4 later if needed. Fix this →

ret.get("content", "") (lines 51, 65)

Changing from ret["content"] to ret.get("content", "") is a safe improvement — it avoids a KeyError if readabilipy ever returns a dict without content. Good.


src/fetch/tests/test_server.py

The 6 new tests cover the happy path (no fallback), each fallback stage, the None content case, and threshold math. That's solid coverage.

Minor: unusual call_count pattern (lines 386–392)

call_count = [0]
def mock_simple_json(h, use_readability=True):
    call_count[0] += 1
    if call_count[0] == 1:

Using a list as a mutable counter is a Python 2 workaround. In Python 3.10+ (the minimum for this server), nonlocal is cleaner:

call_count = 0
def mock_simple_json(h, use_readability=True):
    nonlocal call_count
    call_count += 1
    ...

Or pass side_effect a list of return values: mock.side_effect = [{"content": None}, {"content": ...}].

test_threshold_is_one_percent_of_html (line 398)

The assertion len(result.strip()) > 50 is intentionally loose (just confirming fallback fired), but it would be more valuable to assert the exact content returned (e.g., that "x" * 9000 chars appear in the output), making regressions more visible.


Overall

The core fix is sound and the tests are well-structured. The two actionable issues worth addressing before merge are:

  1. Stage 3 <script>/<style> noise — low effort, high value fix
  2. The 1% threshold — the existing discussion around whether heuristic quality checks belong in the transport layer is valid; consider whether a simpler "always fall through if Readability returns empty/None" condition (without the length comparison) would satisfy the SSR use case without the false-positive risk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mcp-server-fetch drops SSR content from streaming/progressive rendering sites

4 participants