Robots.txt for a multilanguage site where root is redirected

I have a site which offers two languages, English and Spanish. When the user navigates to the home page, let’s say www.example.com the page redirects you to either /es if your browser language is Spanish or English otherwise.

At the moment the robots.txt I have is:

User-agent: * Allow: /  Sitemap: https://www.example.com/sitemap_index.xml 

because I’m defining all hreflang alternate URLs in the sitemap_languages.xml and all URLs are listed also in the sitemap.xml. My question is more towards the configuration of the robots.txt because I’m not sure if I should be allowing any user agent to crawl the / page. As that page always redirects to the home of either /en or /es I believe that should be disallowed.

Should I then do:

User-agent: * Disallow: / Allow: /es Allow: /en  Sitemap: https://www.example.com/sitemap_index.xml 

I’m not sure if that could cause a crawl issue or whether there is another way to achieve the same result.

Restrict crawling of region/lang combinations other than the provided ones in robots.txt

I want to allow crawling of my website only if the URL starts with accepted region/language combinations which are us/en, gb/en and fr/fr. Other language combinations must be restricted. On the other hand crawler should be able to crawl / or /about, etc. For example:

example.com/us/en/videos  # should be allowed example.com/de/en/videos  # should be blocked example.com/users/mark    # should be allowed 

Again, it should be blocked only if it starts with unaccepted region/language combinations. What I did so far does not work:

Disallow: /*? Disallow: /*/*/cart/ Disallow: /*/*/checkout/ Disallow: /*/*/ Allow: /*.css? Allow: /*.js? Allow: /us/en/ Allow: /gb/en/ Allow: /fr/fr/ 

I tested it with google’s online robots.txt tester.

Can robots.txt be used to prevent bots from seeing lazily loaded content?

Let’s say that googlebot is scraping https://example.com/page.

  • example.com has a robots.txt file that disallows /endpoint-for-lazy-loaded-content, but allows /page
  • /page lazy loads content using /endpoint-for-lazy-loaded-content (via fetch)

Does googlebot see the lazy loaded content?

Google says my URL is blocked by robots.txt — I don’t even have one!

I just discovered our image system domain is not being crawled by Google for a long time. The reason is that all the URLs seem to be blocked by robots.txt — but I don’t even have one.

Disclaimer: Due to some config testing, I now have a generic allow-everything robots file at the website root. I didn’t have one prior to this hour.

We run an image resizing system at a subdomain of our website. I’m getting a very weird behaviour as Search Console claims to be blocked by robots.txt, when in fact I don’t even have one in the first place.

All URLs at this subdomain give me this result when live testing them:

url unknown to google

url supposedly blocked by robots

Trying to debug the issue, I created a robots.txt at the root:

valid robots

The robots file is even already visible at search results:

robots indexed

The response headers also seem to be ok:

​HTTP/2 200  date: Sun, 27 Oct 2019 02:22:49 GMT content-type: image/jpeg set-cookie: __cfduid=d348a8xxxx; expires=Mon, 26-Oct-20 02:22:49 GMT; path=/; domain=.legiaodosherois.com.br; HttpOnly; Secure access-control-allow-origin: * cache-control: public, max-age=31536000 via: 1.1 vegur cf-cache-status: HIT age: 1233 expires: Mon, 26 Oct 2020 02:22:49 GMT alt-svc: h3-23=":443"; ma=86400 expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" server: cloudflare cf-ray: 52c134xxx-IAD 

Here are some sample URLs for testing:

https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_zg1YXWVbJwFkxT_ZQR534L90lnm8d2IsjPUGruhqAe.png.jpeg https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_FPutcVi19O8wWo70IZEAkrY3HJfK562panvxblm4SL.png.jpeg https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/09/legiao_gTnwjab0Cz4tp5X8NOmLiWSGEMH29Bq7ZdhVPlUcFu.png.jpeg 

What should I do?

Why is Google indexing our robots.txt file and showing it in search results?

For some reason, Google is indexing the robots.txt file for some of our sites and showing it in search results. See screenshots below.

Our robots.txt file is not linked from anywhere on the site and contains just the following:

User-agent: *
Crawl-delay: 5

This only happens for some sites. Why is this happening and how do we stop it?

[​IMG]

Screenshot 1: Google Search console…

Why is Google indexing our robots.txt file and showing it in search results?

Robots.txt is blocking my labels

In my adsense account in "revenue optimization" i have crawl errors then when i click "fix crawl errors" then this..

Blocked Urls Error
http://www.rechargeoverload.in/search/label Robot Denied

My robots.txt:

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /search
Allow: /

Sitemap: http://www.rechargeoverload.in/atom.xml?redirect=false&start-index=1&max-results=500
Sitemap:…

Robots.txt is blocking my labels

Refer to a websites’ domain in robots.txt by a variable

I have a website (built with MediaWiki 1.33.0 CMS) which contains a robots.txt file.
In that file there is one line containing the literal domain of that site:

Sitemap: https://example.com/sitemap/sitemap-index-example.com.xml

I usually prefer to replace literal domain referrals with a variable value call that will somehow (depends on the specific case) will be changed in execution to the value which is the domain itself.

An example to a VVC would be a Bash variable substitution.


Many CMSs have a global directives file which usually contains the base address of the website:
In MediaWiki 1.33.0 this file is LocalSettings.php which contains the base address in line 32:

$  wgServer = "https://example.com"; 

How could I call this value with a variable value call in robots.txt?
This will help me avoid confusion and malfunction if the domain of the website is changed; I wouldn’t have to change the value manually there as well.

Do the order of “Disallow” and “Sitemap” lines in robots.txt matter?

One can sort robots.txt this way:

User-agent: DESIRED_INPUT Sitemap: https://example.com/sitemap-index.xml Disallow: / 

instead:

User-agent: DESIRED_INPUT Disallow: / Sitemap: https://example.com/sitemap-index.xml 

I assume both are okay because it’s likely the file is compiled in correct order by generally all crawlers.
Is it a best practice to put Disallow: before Sitemap: to prevent an extremely unlikely bug of a crawler’s bad compilation of crawling before ignoring Disallow:?