Guide · 8 min read
RSS for AI agents: structured data from any website
Your agent is only as good as the data it can reach. The web has billions of useful pages and almost none of them ship an API. Here's how to give an AI agent a reliable, structured, pollable feed from any URL — and why the old answer (write a scraper) keeps failing.
The problem: agents need data the web won't hand over
Modern AI agents are good at reasoning and terrible at plumbing. To monitor a competitor's changelog, watch a job board, track product prices, or follow a niche blog, an agent needs that content as clean, structured, machine-readable items. But the source is usually just an HTML page built for humans: no API, no RSS, no JSON.
The reflex is to write a scraper — a few CSS selectors that pluck the title, link, and date out of the markup. It works for a week. Then marketing ships a redesign, the class names change, and your selector returns null. The agent silently goes blind. Multiply that across a dozen sources and you have a maintenance treadmill no one wants to own.
Why brittle scrapers break (and keep breaking)
- Selectors are coupled to markup, not meaning.
.post-card > h3.titledescribes today's HTML, not "the headline." Any layout change snaps it. - Failures are silent. A broken scraper usually returns empty, not an error. Your agent keeps running on stale or missing data.
- Every source is a bespoke project.Scrapers don't generalize; each site is its own snowflake of edge cases.
The fix: treat extraction as a living recipe
Instead of freezing a selector, derive the structure every time you fetch. A good extractor looks at the page and asks: what is the dominant repeated block here? A list of posts, products, or jobs shows up as many sibling elements that share a shape and each contain a link plus some text. From that you can pull a title, URL, summary, date, and image — without hand-writing a single selector.
Because the structure is re-derived on each run, a layout change doesn't break the feed — it just produces a fresh recipe. That's self-healing extraction: the feed keeps flowing while a hard-coded scraper would have died. When higher accuracy matters, an LLM can do the same job on the page text and fall back to the structural method automatically.
Give the agent a feed, not a scraper
Once you can extract items reliably, the right interface is the one the web already settled on decades ago: a feed. Specifically:
- Stable item IDs so the agent can tell new from seen and poll idempotently.
- ISO timestamps so "what changed since yesterday" is trivial.
- Both JSON Feed and RSS from one endpoint — JSON for agents, RSS for everything else.
- Webhooks so the agent gets pushed the diff instead of polling blindly.
How FeedForge does it
FeedForgeis this idea as a service. You paste a URL; it derives the schema, stores the recipe, and gives you a pollable feed URL plus optional webhook push. It re-derives on every refresh, so feeds self-heal. The free tier covers three feeds at hourly refresh; Pro adds 50 feeds, five-minute refresh, and webhooks. There's no signup wall — you get an API token the moment you open the dashboard.
Try it on a site you care about: paste the URL into the live demo and watch it turn a human page into a feed your agent can consume in one call.
Three feeds free, forever. No credit card.
Create your first feed →