Meta robots and X-Robots-Tag
Every page must have an explicit, correct indexing policy — either implicit (default index, follow) on public pages, or an explicit noindex / X-Robots-Tag on staging, admin, thin, or private content. Get this wrong and you either disappear from search or expose what you didn't mean to.
What it is
The robots meta tag and the equivalent X-Robots-Tag HTTP header tell search engines and other compliant crawlers whether they may index a page, follow its links, and how to render it in results. Together they are the per-page complement to robots.txt — robots.txt controls crawling, robots meta controls indexing.
<meta name="robots" content="index, follow">
<meta name="robots" content="noindex, nofollow">
<meta name="robots" content="noindex, max-image-preview:large">
X-Robots-Tag: noindex, nofollow
X-Robots-Tag: googlebot: noindex
The HTTP header form is the only way to control indexing for non-HTML resources — PDFs, images, JSON endpoints, downloads.
Why it matters
Every URL a search engine fetches has a policy, whether you set one or not. The default is index, follow — every fetched page becomes a candidate for the index, and every link is a discovery hop. That is right for your homepage and your articles; it is wrong for:
- Staging and preview environments. Forgetting
noindexon astaging.example.comis a top-five SEO mistake. Once Google indexes a staging URL, it competes with production. - Internal search results, filtered listings, faceted-search permutations. Infinite combinations of low-value pages dilute the site’s crawl budget.
- Thin, duplicate, or transactional pages. Thank-you pages, cart pages, password-reset pages, account screens.
- Pages waiting for content. Empty category pages, “coming soon” pages.
- Sensitive resources. Print-only versions, internal exports, machine-readable feeds you don’t want surfacing as snippets.
Public pages are required to be indexable; non-public pages are required to be non-indexable. Either way, the policy must be explicit and correct.
How to implement
Add the meta tag in <head> for every HTML page that needs a non-default policy:
<meta name="robots" content="noindex, follow">
follow keeps link equity flowing to the destinations even when the host page is not indexed. Pair it with noindex for staging, internal search, faceted listings.
Use directives precisely. The most useful ones, with the values search engines respect:
| Directive | Effect |
|---|---|
index / noindex | Allow / disallow indexing of this page |
follow / nofollow | Allow / disallow following links on this page |
noarchive | Don’t show a cached copy in results |
nosnippet | Don’t show a text snippet in results |
max-snippet:[n] | Cap the snippet at n characters; -1 means no limit |
max-image-preview:[none|standard|large] | Control image-preview size |
max-video-preview:[n] | Cap video preview at n seconds |
noimageindex | Don’t index images on this page |
unavailable_after:[date] | Drop from the index after this RFC 822 / ISO 8601 date |
Target a specific crawler when you need to. Replace robots with googlebot, bingbot, applebot, or another user-agent token. A robots directive applies to all; a named-bot directive overrides it for that bot.
<meta name="robots" content="noindex">
<meta name="googlebot-news" content="noindex, nofollow">
Use X-Robots-Tag for non-HTML. Cloudflare, Nginx, Apache, and CDN edge config can all set it per-path. Required for PDFs you don’t want indexed.
X-Robots-Tag: noindex, nofollow
Don’t combine robots.txt: Disallow with noindex. If a URL is disallowed in robots.txt, crawlers never fetch it — and therefore never see the noindex. The page can still be indexed from external links, with no snippet. To truly de-index, allow crawl and serve noindex.
Default policy is implicit. A public page with no robots meta tag is index, follow. You do not need to add <meta name="robots" content="index, follow"> — it’s the default, and shipping it on every page is noise. Add the tag only when the policy differs.
Common mistakes
- Staging site indexed because someone forgot to set
noindex. Block it at the server withX-Robots-Tag: noindexsite-wide on the staging host. Disallow: /in robots.txt on a staging site instead ofnoindex. The page may still appear in results with the URL but no snippet. Usenoindexand let the crawler fetch and see it.- Shipping a launch with
<meta name="robots" content="noindex, nofollow">left over from staging. Diff every deploy. - Conflicting directives:
<meta name="robots" content="index">paired withX-Robots-Tag: noindex. The most restrictive wins; the result is usually not what was intended. - Using
nofollowto “save crawl budget” by stopping internal link discovery. It does not save budget; it just throws away signals. - Relying on
noindexto protect sensitive data. It is a request, not an access control. Use authentication.
Verification
- View source on every key URL — confirm the
<meta name="robots">value matches intent. curl -I https://example.com/private/— confirmX-Robots-Tagis set as expected.- Google Search Console → URL Inspection. The “Indexing allowed” line is authoritative.
- After a staging deploy, search
site:staging.example.comin Google. Zero results is the only acceptable answer. - Audit tools (Screaming Frog, Sitebulb) export every URL’s
robotsdirective — review the list before launch and after major releases.
Related topics
Sources & further reading
- HTML Living Standard — Standard metadata names: robots — WHATWG
- Google Search Central — Robots meta tag, data-nosnippet, and X-Robots-Tag — Google Search Central
- Bing Webmaster — Robots meta tag — Bing
- MDN — <meta name="robots"> — MDN