Skip to content

fix(fetch): add fallback extraction for readability-stripped content#3947

Closed
morozow wants to merge 3 commits intomodelcontextprotocol:mainfrom
morozow:fix/fetch-readability-hidden-content-fallback
Closed

fix(fetch): add fallback extraction for readability-stripped content#3947
morozow wants to merge 3 commits intomodelcontextprotocol:mainfrom
morozow:fix/fetch-readability-hidden-content-fallback

Conversation

@morozow
Copy link
Copy Markdown

@morozow morozow commented Apr 14, 2026

Description

Add fallback content extraction in mcp-server-fetch when Mozilla Readability strips too much content from the page. This fixes silent content loss on sites using progressive SSR with hidden pre-hydration markup.
Issue: #3878

Server Details

  • Server: fetch
  • Changes to: extract_content_from_html() in src/fetch/src/mcp_server_fetch/server.py

Motivation and Context

Sites using progressive server-side rendering (streaming + Lambda SSR) deliver content in two phases:

  1. An immediate visible loading shell (small HTML)
  2. SSR-rendered content in a hidden container (visibility:hidden; position:absolute; top:-9999px) that becomes visible after React hydration

Mozilla Readability treats hidden elements as non-content and strips them. The fetch tool receives the full HTML (e.g. 83 KB or 245 KB) but returns only the loading shell text — typically a single line — with no error or warning.

Before fix:

fetch("https://runtimeweb.com")
→ "Unified Serverless Framework for Full-Stack TypeScript Applications"
   (1 line from 245 KB)

fetch("https://stdiobus.com/spec")
→ "Transport-level routing for MCP/ACP protocols"
   (1 line from 83 KB)

After fix:
Both return full page content — headings, paragraphs, tables, code blocks, navigation.

This pattern is used by Next.js streaming, Remix deferred loaders, and custom SSR architectures. The number of affected sites will grow as progressive SSR adoption increases.

How Has This Been Tested?

37 tests total (20 existing + 17 new), all passing.

Tests cover:

  • Happy path — Readability works, no fallback triggered, including threshold boundary
  • Stage 2 fallback — Readability returns too little / None / missing key / whitespace-only → no-readability mode recovers
  • Stage 3 fallback — both readabilipy modes fail → raw markdownify recovers content
  • Real-world SSR patterns — visibility:hidden, position:absolute; top:-9999px, opacity:0
  • Edge cases — empty input, very small HTML, degenerate markup
  • Contract guarantee — no code path produces <error> tags; function always returns extracted content or empty string, leaving quality judgment to the caller
  • Full cascade verification — confirms exact call sequence and stage transitions

Manually verified against https://stdiobus.com and https://runtimeweb.com.

curl -s https://stdiobus.com/ | wc -c        # ~83965 bytes
curl -s https://runtimeweb.com/ | wc -c      # ~245608 bytes
curl -s https://stdiobus.com/ | grep -c "visibility:hidden"   # 1

Breaking Changes

None. The fix only activates when Readability extracts less than 1% of the HTML size as text. Normal sites where Readability works correctly are unaffected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Protocol Documentation
  • My changes follows MCP security best practices
  • I have updated the server's README accordingly
  • I have tested this with an LLM client
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have documented all environment variables and configuration options

Additional context

The fix adds a three-stage extraction pipeline:

  1. Readability (existing) — works for most sites
  2. readabilipy without Readability JS (new fallback) — less aggressive, doesn't filter by visibility
  3. Raw markdownify (new last resort) — converts full HTML directly

Stage 2 and 3 only activate when stage 1 produces text shorter than 1% of the input HTML (max(1, len(html) // 100)). No new dependencies added.

Root cause: readabilipy.simple_json.simple_json_from_html_string(html, use_readability=True) invokes Mozilla Readability which evaluates CSS visibility and discards elements with visibility:hidden, position:absolute; top:-9999px, opacity:0. These styles are standard in progressive SSR for pre-hydration content delivery.

morozow added 2 commits April 9, 2026 01:50
- Add fallback mechanism when readability extracts minimal content (e.g., SSR hydration divs)
- Implement minimum content length threshold (1% of original HTML) to detect over-aggressive stripping
- Fall back to extraction without readability, then raw HTML markdownify if threshold not met
- Update extract_content_from_html to gracefully handle empty content instead of returning error
- Add comprehensive test suite for fallback scenarios including SSR patterns and hidden content
- Update test_empty_content_returns_error to test graceful handling instead of error message
- Replace single generic test with 11 comprehensive test cases covering fallback scenarios
- Add tests for readability returning sufficient content without fallback
- Add tests for readability stripping content and triggering fallback to non-readability mode
- Add tests for both readability modes failing and falling back to raw markdownify
- Add tests for None content and missing 'content' key edge cases
- Add tests for 1% threshold calculation and boundary conditions
- Add tests for whitespace-only content triggering fallback
- Add tests for stage 2 (no-readability) failures falling back to raw extraction
- Improve test clarity with descriptive names and docstrings explaining each scenario
- Ensure comprehensive coverage of the three-stage fallback extraction pipeline
@olaservo
Copy link
Copy Markdown
Member

Thanks for the contribution! This is addressed by #3922 which has already been approved and is awaiting merge. Closing as duplicate.

@olaservo olaservo closed this Apr 18, 2026
@morozow
Copy link
Copy Markdown
Author

morozow commented Apr 18, 2026

Thanks for the contribution! This is addressed by #3922 which has already been approved and is awaiting merge. Closing as duplicate.

@olaservo Please review my comments #3922 (comment), #3878 (comment) before making a decision

@cliffhall cliffhall added bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch labels Apr 20, 2026
@cliffhall
Copy link
Copy Markdown
Member

@claude review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 21, 2026

Claude finished @cliffhall's task in 2m 47s —— View job


Code Review

  • Gather context and understand request
  • Read changed files and diff
  • Analyze code quality, correctness, and design
  • Post review feedback

This PR adds a three-stage fallback pipeline to extract_content_from_html() to handle progressive SSR sites where Mozilla Readability strips visibility-hidden content. The motivation is well-documented and addresses a real problem. Overall the approach is sound, but there are a few concrete issues worth addressing — particularly one that could cause a runtime error.

Issues

1. KeyError risk in stage 2 fallback (server.py, line ~58)

Stage 1 correctly uses ret.get("content", ""), but stage 2 reverts to direct key access:

# Stage 1 (safe):
content_html = ret.get("content", "")

# Stage 2 (unsafe):
ret = readabilipy.simple_json.simple_json_from_html_string(html, use_readability=False)
if ret["content"]:   # ← KeyError if "content" key is absent

If readabilipy with use_readability=False ever returns a dict without a "content" key, this raises an unhandled KeyError. The test suite covers stage 1 missing key (test_readability_missing_content_key_triggers_fallback) but doesn't cover the same scenario for stage 2. Should be ret.get("content") for consistency and safety. Fix this →

2. Stage 3 may produce extremely noisy output

The last-resort markdownify.markdownify(html, ...) converts the raw HTML including all inline <script> and <style> blocks. While markdownify strips some tags, large pages with embedded JavaScript (e.g. Next.js bundles) can produce thousands of lines of noise. For a 245KB page this could generate output far larger than the max_length limit, and the LLM will see JS/CSS fragments in the extracted content.

A safer stage 3 would strip script and style tags first:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style"]):
    tag.decompose()
content = markdownify.markdownify(str(soup), heading_style=markdownify.ATX)

However, bs4 is already a transitive dependency via readabilipy, so no new dependency is required. Fix this →

3. The 1% threshold fires on content-light but tag-heavy pages

min_length = max(1, len(html) // 100) compares extracted text length against HTML byte length. A page that is 80KB of <script> bundles + 2KB of readable article text would have min_length = 819, but Readability might correctly extract only 400 chars of clean article text — triggering an unnecessary fallback. The heuristic works well for the SSR case but can produce regressions on JS-heavy single page apps where Readability is already doing the right thing.

4. Silent stage transitions

There is no logging or signaling when a fallback is triggered. This makes it hard to diagnose unexpected content degradation in production. Even a logging.debug(...) when stages 2 or 3 activate would help operators.


Positives

  • No new dependencies — the fix is entirely within the existing readabilipy/markdownify stack.
  • Breaking the <error> return — removing "<error>Page failed to be simplified from HTML</error>" is strictly better; returning an empty string is cleaner and easier for callers to handle.
  • Test coverage — 17 new tests are well-structured and cover the cascade contract, boundary conditions, and SSR patterns comprehensively. The test_all_three_stages_called_when_needed test that verifies the exact call sequence is particularly good.
  • The core fallback logic is correct — the three-stage approach (readability → readabilipy without readability JS → raw HTML) is a reasonable escalation ladder.

Note on duplicate

This PR was closed as a duplicate of #3922. The author has raised concerns in comments on that PR. Worth verifying that #3922 also addresses the KeyError safety issue (stage 2 ret["content"]) before merging the duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants