Thing that's unintuitive about robots.txt: If you disallow a page from being visited by the crawler, the crawler may still add the page to an index because it will never see the noindex tag on the page
@Gargron And some crawlers just ignore robots.txt entirely so you have to make a tar pit for them.
@Gargron Unintuitive in fact. But I don't think this is how it's designed.
If Google would add a URL to their index that is inaccessable to their bots as per robots.txt, I'm pretty certain that would violate the robots excludion protocol.
Google know the page exists because there's links to it, it knows a search ranking for it based on link text and quality of linking sites, and so it's indexed - but Google can't visit the site, so they don't show a description in results, and keyword searches are limited because they haven't indexed the page text.
@Gargron How did you figure this out? It sounds like there is a story behind that discovery.
@hef Oh I was just reading about it on another site recently and found it peculiar because I also assumed that a robots.txt entry and noindex are synonymous.
@Gargron I've always treated robots.txt as a traffic manager, not a content manager. Want less traffic?Disallow. Want no content on search engines? Noindex. Want privacy? Password protect (or remove from Internet 😃
Server run by the main developers of the project It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!