RFC: Architectural Shift – Decoupling DOM Parsing from Heuristic Scoring in Auto-Sourcing #1195
Unanswered
gildesmarais
asked this question in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This RFC proposes a major architectural evolution for
html2rss's auto-sourcing engine. The goal is to transition the library from a bottom-up, CSS-dependent scraper into a decoupled, semantic content extraction pipeline.We invite community feedback, design ideas, and review on this proposed direction.
1. Context & Motivation
The current auto-sourcing engine uses a bottom-up heuristic traversal (climbing parent pointers from leaf anchors, grouping classes, and guessing container bounds). While highly optimized, it exhibits several design symptoms:
2. Proposed Three-Tier Architecture
To solve these issues, we propose decoupling the DOM representation from candidate scoring and extraction:
graph TD A["Raw HTML Source"] --> B["1. Document Normalizer"] B -->|"Simplified Semantic Tree (SST)"| C["2. Segmenter & Card Clusterer"] C -->|"Candidate Segments"| D["3. Feature & Scoring Engine"] D -->|"Winning Containers"| E["4. Schema Extractor"] E --> F["RSS 2.0 Feed Items"]3. Backward Compatibility & Design Guardrails
Dataclasses), and functional composition over metaprogramming to make the codebase clean and accessible for both human contributors and AI coding assistants.4. Advanced Future Roadmap Ideas
Once the decoupled structure is in place, the library is positioned to adopt:
x, y, w, h) from Browserless viewports to segment layout containers visually.5. Read the Full Documentation
Detailed design blueprints, maintainability reviews, and coding patterns have been committed to the
docs/architecture-evolutionbranch. You can review them here:docs/architectural_shift/review.md: Heuristics review, symptoms, and the Three-Tier blueprint.docs/architectural_shift/blueprint.md: Concrete Ruby draft classes (SST Node, Normalizer, Scoring Engine).docs/architectural_shift/guardrails.md: Coding standards, performance limits, and complexity budgets.docs/architectural_shift/maintainability.md: Testing isolation and state-of-the-art scraping integration maps.Please share your thoughts, concerns, and suggestions below!
Beta Was this translation helpful? Give feedback.
All reactions