fix(fetch): add fallback extraction for readability-stripped content#3947
fix(fetch): add fallback extraction for readability-stripped content#3947morozow wants to merge 3 commits intomodelcontextprotocol:mainfrom
Conversation
- Add fallback mechanism when readability extracts minimal content (e.g., SSR hydration divs) - Implement minimum content length threshold (1% of original HTML) to detect over-aggressive stripping - Fall back to extraction without readability, then raw HTML markdownify if threshold not met - Update extract_content_from_html to gracefully handle empty content instead of returning error - Add comprehensive test suite for fallback scenarios including SSR patterns and hidden content - Update test_empty_content_returns_error to test graceful handling instead of error message
- Replace single generic test with 11 comprehensive test cases covering fallback scenarios - Add tests for readability returning sufficient content without fallback - Add tests for readability stripping content and triggering fallback to non-readability mode - Add tests for both readability modes failing and falling back to raw markdownify - Add tests for None content and missing 'content' key edge cases - Add tests for 1% threshold calculation and boundary conditions - Add tests for whitespace-only content triggering fallback - Add tests for stage 2 (no-readability) failures falling back to raw extraction - Improve test clarity with descriptive names and docstrings explaining each scenario - Ensure comprehensive coverage of the three-stage fallback extraction pipeline
|
Thanks for the contribution! This is addressed by #3922 which has already been approved and is awaiting merge. Closing as duplicate. |
@olaservo Please review my comments #3922 (comment), #3878 (comment) before making a decision |
|
@claude review |
|
Claude finished @cliffhall's task in 2m 47s —— View job Code Review
This PR adds a three-stage fallback pipeline to Issues1. Stage 1 correctly uses # Stage 1 (safe):
content_html = ret.get("content", "")
# Stage 2 (unsafe):
ret = readabilipy.simple_json.simple_json_from_html_string(html, use_readability=False)
if ret["content"]: # ← KeyError if "content" key is absentIf 2. Stage 3 may produce extremely noisy output The last-resort A safer stage 3 would strip script and style tags first: from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style"]):
tag.decompose()
content = markdownify.markdownify(str(soup), heading_style=markdownify.ATX)However, 3. The 1% threshold fires on content-light but tag-heavy pages
4. Silent stage transitions There is no logging or signaling when a fallback is triggered. This makes it hard to diagnose unexpected content degradation in production. Even a Positives
Note on duplicateThis PR was closed as a duplicate of #3922. The author has raised concerns in comments on that PR. Worth verifying that #3922 also addresses the |
Description
Add fallback content extraction in
mcp-server-fetchwhen Mozilla Readability strips too much content from the page. This fixes silent content loss on sites using progressive SSR with hidden pre-hydration markup.Issue: #3878
Server Details
fetchextract_content_from_html()insrc/fetch/src/mcp_server_fetch/server.pyMotivation and Context
Sites using progressive server-side rendering (streaming + Lambda SSR) deliver content in two phases:
visibility:hidden; position:absolute; top:-9999px) that becomes visible after React hydrationMozilla Readability treats hidden elements as non-content and strips them. The
fetchtool receives the full HTML (e.g. 83 KB or 245 KB) but returns only the loading shell text — typically a single line — with no error or warning.Before fix:
After fix:
Both return full page content — headings, paragraphs, tables, code blocks, navigation.
This pattern is used by Next.js streaming, Remix deferred loaders, and custom SSR architectures. The number of affected sites will grow as progressive SSR adoption increases.
How Has This Been Tested?
37 tests total (20 existing + 17 new), all passing.
Tests cover:
None/ missing key / whitespace-only → no-readability mode recoversvisibility:hidden,position:absolute; top:-9999px,opacity:0<error>tags; function always returns extracted content or empty string, leaving quality judgment to the callerManually verified against
https://stdiobus.comandhttps://runtimeweb.com.Breaking Changes
None. The fix only activates when Readability extracts less than 1% of the HTML size as text. Normal sites where Readability works correctly are unaffected.
Types of changes
Checklist
Additional context
The fix adds a three-stage extraction pipeline:
Stage 2 and 3 only activate when stage 1 produces text shorter than 1% of the input HTML (
max(1, len(html) // 100)). No new dependencies added.Root cause:
readabilipy.simple_json.simple_json_from_html_string(html, use_readability=True)invokes Mozilla Readability which evaluates CSS visibility and discards elements withvisibility:hidden,position:absolute; top:-9999px,opacity:0. These styles are standard in progressive SSR for pre-hydration content delivery.