fix(fetch): add fallback extraction for readability-stripped content by morozow · Pull Request #3947 · modelcontextprotocol/servers

morozow · 2026-04-14T12:29:01Z

Description

Add fallback content extraction in mcp-server-fetch when Mozilla Readability strips too much content from the page. This fixes silent content loss on sites using progressive SSR with hidden pre-hydration markup.
Issue: #3878

Server Details

Server: fetch
Changes to: extract_content_from_html() in src/fetch/src/mcp_server_fetch/server.py

Motivation and Context

Sites using progressive server-side rendering (streaming + Lambda SSR) deliver content in two phases:

An immediate visible loading shell (small HTML)
SSR-rendered content in a hidden container (visibility:hidden; position:absolute; top:-9999px) that becomes visible after React hydration

Mozilla Readability treats hidden elements as non-content and strips them. The fetch tool receives the full HTML (e.g. 83 KB or 245 KB) but returns only the loading shell text — typically a single line — with no error or warning.

Before fix:

fetch("https://runtimeweb.com")
→ "Unified Serverless Framework for Full-Stack TypeScript Applications"
   (1 line from 245 KB)

fetch("https://stdiobus.com/spec")
→ "Transport-level routing for MCP/ACP protocols"
   (1 line from 83 KB)

After fix:
Both return full page content — headings, paragraphs, tables, code blocks, navigation.

This pattern is used by Next.js streaming, Remix deferred loaders, and custom SSR architectures. The number of affected sites will grow as progressive SSR adoption increases.

How Has This Been Tested?

37 tests total (20 existing + 17 new), all passing.

Tests cover:

Happy path — Readability works, no fallback triggered, including threshold boundary
Stage 2 fallback — Readability returns too little / None / missing key / whitespace-only → no-readability mode recovers
Stage 3 fallback — both readabilipy modes fail → raw markdownify recovers content
Real-world SSR patterns — visibility:hidden, position:absolute; top:-9999px, opacity:0
Edge cases — empty input, very small HTML, degenerate markup
Contract guarantee — no code path produces <error> tags; function always returns extracted content or empty string, leaving quality judgment to the caller
Full cascade verification — confirms exact call sequence and stage transitions

Manually verified against https://stdiobus.com and https://runtimeweb.com.

curl -s https://stdiobus.com/ | wc -c        # ~83965 bytes
curl -s https://runtimeweb.com/ | wc -c      # ~245608 bytes
curl -s https://stdiobus.com/ | grep -c "visibility:hidden"   # 1

Breaking Changes

None. The fix only activates when Readability extracts less than 1% of the HTML size as text. Normal sites where Readability works correctly are unaffected.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Protocol Documentation
My changes follows MCP security best practices
I have updated the server's README accordingly
I have tested this with an LLM client
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have documented all environment variables and configuration options

Additional context

The fix adds a three-stage extraction pipeline:

Readability (existing) — works for most sites
readabilipy without Readability JS (new fallback) — less aggressive, doesn't filter by visibility
Raw markdownify (new last resort) — converts full HTML directly

Stage 2 and 3 only activate when stage 1 produces text shorter than 1% of the input HTML (max(1, len(html) // 100)). No new dependencies added.

Root cause: readabilipy.simple_json.simple_json_from_html_string(html, use_readability=True) invokes Mozilla Readability which evaluates CSS visibility and discards elements with visibility:hidden, position:absolute; top:-9999px, opacity:0. These styles are standard in progressive SSR for pre-hydration content delivery.

- Add fallback mechanism when readability extracts minimal content (e.g., SSR hydration divs) - Implement minimum content length threshold (1% of original HTML) to detect over-aggressive stripping - Fall back to extraction without readability, then raw HTML markdownify if threshold not met - Update extract_content_from_html to gracefully handle empty content instead of returning error - Add comprehensive test suite for fallback scenarios including SSR patterns and hidden content - Update test_empty_content_returns_error to test graceful handling instead of error message

- Replace single generic test with 11 comprehensive test cases covering fallback scenarios - Add tests for readability returning sufficient content without fallback - Add tests for readability stripping content and triggering fallback to non-readability mode - Add tests for both readability modes failing and falling back to raw markdownify - Add tests for None content and missing 'content' key edge cases - Add tests for 1% threshold calculation and boundary conditions - Add tests for whitespace-only content triggering fallback - Add tests for stage 2 (no-readability) failures falling back to raw extraction - Improve test clarity with descriptive names and docstrings explaining each scenario - Ensure comprehensive coverage of the three-stage fallback extraction pipeline

olaservo · 2026-04-18T02:43:10Z

Thanks for the contribution! This is addressed by #3922 which has already been approved and is awaiting merge. Closing as duplicate.

morozow · 2026-04-18T06:03:53Z

Thanks for the contribution! This is addressed by #3922 which has already been approved and is awaiting merge. Closing as duplicate.

@olaservo Please review my comments #3922 (comment), #3878 (comment) before making a decision

cliffhall · 2026-04-21T14:30:50Z

@claude review

claude · 2026-04-21T14:31:08Z

Claude finished @cliffhall's task in 2m 47s —— View job

Code Review

Gather context and understand request
Read changed files and diff
Analyze code quality, correctness, and design
Post review feedback

This PR adds a three-stage fallback pipeline to extract_content_from_html() to handle progressive SSR sites where Mozilla Readability strips visibility-hidden content. The motivation is well-documented and addresses a real problem. Overall the approach is sound, but there are a few concrete issues worth addressing — particularly one that could cause a runtime error.

Issues

1. KeyError risk in stage 2 fallback (server.py, line ~58)

Stage 1 correctly uses ret.get("content", ""), but stage 2 reverts to direct key access:

# Stage 1 (safe):
content_html = ret.get("content", "")

# Stage 2 (unsafe):
ret = readabilipy.simple_json.simple_json_from_html_string(html, use_readability=False)
if ret["content"]:   # ← KeyError if "content" key is absent

If readabilipy with use_readability=False ever returns a dict without a "content" key, this raises an unhandled KeyError. The test suite covers stage 1 missing key (test_readability_missing_content_key_triggers_fallback) but doesn't cover the same scenario for stage 2. Should be ret.get("content") for consistency and safety. Fix this →

2. Stage 3 may produce extremely noisy output

The last-resort markdownify.markdownify(html, ...) converts the raw HTML including all inline <script> and <style> blocks. While markdownify strips some tags, large pages with embedded JavaScript (e.g. Next.js bundles) can produce thousands of lines of noise. For a 245KB page this could generate output far larger than the max_length limit, and the LLM will see JS/CSS fragments in the extracted content.

A safer stage 3 would strip script and style tags first:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style"]):
    tag.decompose()
content = markdownify.markdownify(str(soup), heading_style=markdownify.ATX)

However, bs4 is already a transitive dependency via readabilipy, so no new dependency is required. Fix this →

3. The 1% threshold fires on content-light but tag-heavy pages

min_length = max(1, len(html) // 100) compares extracted text length against HTML byte length. A page that is 80KB of <script> bundles + 2KB of readable article text would have min_length = 819, but Readability might correctly extract only 400 chars of clean article text — triggering an unnecessary fallback. The heuristic works well for the SSR case but can produce regressions on JS-heavy single page apps where Readability is already doing the right thing.

4. Silent stage transitions

There is no logging or signaling when a fallback is triggered. This makes it hard to diagnose unexpected content degradation in production. Even a logging.debug(...) when stages 2 or 3 activate would help operators.

Positives

No new dependencies — the fix is entirely within the existing readabilipy/markdownify stack.
Breaking the <error> return — removing "<error>Page failed to be simplified from HTML</error>" is strictly better; returning an empty string is cleaner and easier for callers to handle.
Test coverage — 17 new tests are well-structured and cover the cascade contract, boundary conditions, and SSR patterns comprehensively. The test_all_three_stages_called_when_needed test that verifies the exact call sequence is particularly good.
The core fallback logic is correct — the three-stage approach (readability → readabilipy without readability JS → raw HTML) is a reasonable escalation ladder.

Note on duplicate

This PR was closed as a duplicate of #3922. The author has raised concerns in comments on that PR. Worth verifying that #3922 also addresses the KeyError safety issue (stage 2 ret["content"]) before merging the duplicate.

morozow added 2 commits April 9, 2026 01:50

morozow mentioned this pull request Apr 14, 2026

fix(fetch): fall back when Readability strips hidden SSR content #3922

Open

3 tasks

Merge branch 'main' into fix/fetch-readability-hidden-content-fallback

9cee4c7

olaservo closed this Apr 18, 2026

morozow mentioned this pull request Apr 18, 2026

mcp-server-fetch drops SSR content from streaming/progressive rendering sites #3878

Open

cliffhall added bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch labels Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fetch): add fallback extraction for readability-stripped content#3947

fix(fetch): add fallback extraction for readability-stripped content#3947
morozow wants to merge 3 commits intomodelcontextprotocol:mainfrom
morozow:fix/fetch-readability-hidden-content-fallback

morozow commented Apr 14, 2026

Uh oh!

olaservo commented Apr 18, 2026

Uh oh!

morozow commented Apr 18, 2026 •

edited

Loading

Uh oh!

cliffhall commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

morozow commented Apr 14, 2026

Description

Server Details

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

olaservo commented Apr 18, 2026

Uh oh!

morozow commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cliffhall commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Issues

Positives

Note on duplicate

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

morozow commented Apr 18, 2026 •

edited

Loading

claude Bot commented Apr 21, 2026 •

edited

Loading