Thing that's unintuitive about robots.txt: If you disallow a page from being visited by the crawler, the crawler may still add the page to an index because it will never see the noindex tag on the page

@Gargron And some crawlers just ignore robots.txt entirely so you have to make a tar pit for them.

@Gargron Unintuitive in fact. But I don't think this is how it's designed.

If Google would add a URL to their index that is inaccessable to their bots as per robots.txt, I'm pretty certain that would violate the robots excludion protocol.

@marian @Gargron a common cause of this is that there's a bunch of indexed sites that link to a something disallowed by robots.txt.

Google know the page exists because there's links to it, it knows a search ranking for it based on link text and quality of linking sites, and so it's indexed - but Google can't visit the site, so they don't show a description in results, and keyword searches are limited because they haven't indexed the page text.

@Gargron How did you figure this out? It sounds like there is a story behind that discovery.

@hef Oh I was just reading about it on another site recently and found it peculiar because I also assumed that a robots.txt entry and noindex are synonymous.

@Gargron I've always treated robots.txt as a traffic manager, not a content manager. Want less traffic?Disallow. Want no content on search engines? Noindex. Want privacy? Password protect (or remove from Internet 😃

