Methodology

Cited is an investigatory tool and platform that assesses the AI discoverability of news content and publishers.

It answers one basic question about AI visibility and news: Is there proof that this article can influence AI platforms right now?

Cited works by observing available data and running simulations. When aggregated at the enterprise level, the Cited platform documents and retains real, technical evidence of what each named AI platform does when pointed towards a given news article.

This allows any interested party to monitor how influence is being exerted through the earned media medium.

All of Cited's judgments are algorithmic. There is, probably for the better, not one iota of AI inside this AI analytics tool.

Cited's assessment is a layered pipeline. Each layer surfaces a different kind of evidence about whether a news outlet's reporting is accessible to AI platforms across training, real-time retrieval, and AI-powered search.

Layers run independently and combine into a per-platform posture and a net grade. Technical details for each layer are made available via copy-paste.

Five common AI platforms are currently used. Adding more is conducive to scale.

Layer 0 — News classification

Cited is an earned-media tool and, thus, will only assess news outlets. Before the main pipeline runs, a preflight layer collects evidence that the submitted URL is a real editorial operation and classifies the result as news, borderline, or not-news.

If a submitted URL fails at this layer, then it is not assessed. If it is borderline, it is assessed with that caveat. If L0 confirms the submitted URL is real news, then it runs the assessment and provides the results.

On its own, the L0 algorithm can separate news from promotional or marketing content with great accuracy.

Layer 1 — robots.txt

Cited fetches /robots.txt from the root domain and parses it against each AI platform's documented user agents. A publisher's instructions to AI platforms tend to live inside this file, but compliance is voluntary.

These instructions are used and retained because they are symbolic of a publication's stance on AI. But because nobody really knows which models, if any, are obeying these instructions, they are not deterministic.

Layer 2 — Other declarations

Beyond robots.txt, sites can signal preferences in other ways. Those details are assessed and analyzed at this layer.

From the same response, Cited also fingerprints the publisher's CDN and hosting. Some technical stacks provide one-click controls to disallow AI access at the network level; this is retained because it proves a publisher has the technical capacity to block AI platforms if it wants to.

Layer 4 — User-agent A/B probing

At L4, Cited pulls a sample of recent articles, fetches each with a baseline browser user agent, then refetches as each AI bot's canonical user agent.

This can prove a publisher is not serving AI platforms the same content as other traffic, indicating active blocking. If a publication's instructions to AI platforms are not being obeyed, it will show up here.

When a site blocks our probes entirely — including the ordinary browser request — there is nothing for Cited to compare, so L4 returns inconclusive, which pushes the assessment's confidence downward.

Layer 5 — Common Crawl presence

Common Crawl feeds many open and proprietary AI training corpora. Cited queries the Common Crawl CDX index for the domain and reports a training intensity and a trend direction.

This layer determines how the URL in question can be used and has been used to train AI platforms.

Layer 7 — Real-origin access probing

L7 asks each named AI platform to fetch a sampled article from its own infrastructure — not a Cited server wearing the platform's user agent, but the real platform, from its real servers. The request travels through a Cited redirect that records whether the fetch actually happened and what egress IP it came from.

This is the most direct evidence in the pipeline: proof that the door is open (or shut) for a specific platform right now. It speaks only to real-time retrieval and search — not training — and it reports what is retrievable, never whether a platform chose to cite the piece in an answer.

Usage

Across all assessments to date, Cited has analyzed 100 domains and simulated 3,850 AI-bot requests across 330 distinct article pages.

As of July 14, 2026.