Can robots.txt be used to prevent bots from seeing lazily loaded content?

Let’s say that googlebot is scraping https://example.com/page.

  • example.com has a robots.txt file that disallows /endpoint-for-lazy-loaded-content, but allows /page
  • /page lazy loads content using /endpoint-for-lazy-loaded-content (via fetch)

Does googlebot see the lazy loaded content?

Google says my URL is blocked by robots.txt — I don’t even have one!

I just discovered our image system domain is not being crawled by Google for a long time. The reason is that all the URLs seem to be blocked by robots.txt — but I don’t even have one.

Disclaimer: Due to some config testing, I now have a generic allow-everything robots file at the website root. I didn’t have one prior to this hour.

We run an image resizing system at a subdomain of our website. I’m getting a very weird behaviour as Search Console claims to be blocked by robots.txt, when in fact I don’t even have one in the first place.

All URLs at this subdomain give me this result when live testing them:

url unknown to google

url supposedly blocked by robots

Trying to debug the issue, I created a robots.txt at the root:

valid robots

The robots file is even already visible at search results:

robots indexed

The response headers also seem to be ok:

​HTTP/2 200  date: Sun, 27 Oct 2019 02:22:49 GMT content-type: image/jpeg set-cookie: __cfduid=d348a8xxxx; expires=Mon, 26-Oct-20 02:22:49 GMT; path=/; domain=.legiaodosherois.com.br; HttpOnly; Secure access-control-allow-origin: * cache-control: public, max-age=31536000 via: 1.1 vegur cf-cache-status: HIT age: 1233 expires: Mon, 26 Oct 2020 02:22:49 GMT alt-svc: h3-23=":443"; ma=86400 expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" server: cloudflare cf-ray: 52c134xxx-IAD 

Here are some sample URLs for testing:

https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_zg1YXWVbJwFkxT_ZQR534L90lnm8d2IsjPUGruhqAe.png.jpeg https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_FPutcVi19O8wWo70IZEAkrY3HJfK562panvxblm4SL.png.jpeg https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/09/legiao_gTnwjab0Cz4tp5X8NOmLiWSGEMH29Bq7ZdhVPlUcFu.png.jpeg 

What should I do?

Why is Google indexing our robots.txt file and showing it in search results?

For some reason, Google is indexing the robots.txt file for some of our sites and showing it in search results. See screenshots below.

Our robots.txt file is not linked from anywhere on the site and contains just the following:

User-agent: *
Crawl-delay: 5

This only happens for some sites. Why is this happening and how do we stop it?

[​IMG]

Screenshot 1: Google Search console…

Why is Google indexing our robots.txt file and showing it in search results?

Robots.txt is blocking my labels

In my adsense account in "revenue optimization" i have crawl errors then when i click "fix crawl errors" then this..

Blocked Urls Error
http://www.rechargeoverload.in/search/label Robot Denied

My robots.txt:

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /search
Allow: /

Sitemap: http://www.rechargeoverload.in/atom.xml?redirect=false&start-index=1&max-results=500
Sitemap:…

Robots.txt is blocking my labels

Refer to a websites’ domain in robots.txt by a variable

I have a website (built with MediaWiki 1.33.0 CMS) which contains a robots.txt file.
In that file there is one line containing the literal domain of that site:

Sitemap: https://example.com/sitemap/sitemap-index-example.com.xml

I usually prefer to replace literal domain referrals with a variable value call that will somehow (depends on the specific case) will be changed in execution to the value which is the domain itself.

An example to a VVC would be a Bash variable substitution.


Many CMSs have a global directives file which usually contains the base address of the website:
In MediaWiki 1.33.0 this file is LocalSettings.php which contains the base address in line 32:

$  wgServer = "https://example.com"; 

How could I call this value with a variable value call in robots.txt?
This will help me avoid confusion and malfunction if the domain of the website is changed; I wouldn’t have to change the value manually there as well.

Do the order of “Disallow” and “Sitemap” lines in robots.txt matter?

One can sort robots.txt this way:

User-agent: DESIRED_INPUT Sitemap: https://example.com/sitemap-index.xml Disallow: / 

instead:

User-agent: DESIRED_INPUT Disallow: / Sitemap: https://example.com/sitemap-index.xml 

I assume both are okay because it’s likely the file is compiled in correct order by generally all crawlers.
Is it a best practice to put Disallow: before Sitemap: to prevent an extremely unlikely bug of a crawler’s bad compilation of crawling before ignoring Disallow:?

Page Resources Couldn’t be Loaded” on GSC even after clearing everything on robots.txt

Google Search Console and Mobile-Friendly Test both give me the following two warnings for my WordPress based website:

  • Content wider than screen
  • Clickable elements too close together

The screenshot that these sites provide of my website completely looks broken as if no CSS was applied.

Many solutions to this problem seem to identify the robots.txt file as the culprit, as some users may be blocking google bot from accessing the resource files such as stylesheet or JavaScript.

My case was different. The following state is how my robots.txt file looks like, and I still get the same warning messages none the less. I am an SEO framework user, so I created my own static version of the robots.txt.

User-agent: *     Allow: /  Sitemap: https://***** 

There are also suggestions that the weight (heaviness) of the website is to be blamed. In my case, I have only a few JavaScript files that are mainly in charge of some very light tasks such as carousel, slide-down answers for faq, and the menu button for the nav-menu.

I tried many things including switching themes and surprisingly, the same issue happens even for the WordPress official theme “twenty-seventeen” and also “twenty-nineteen” or the blank version of the “Underscores” theme, but not when I used my original theme that doesn’t have any JavaScript files.

Do I really have to go the route of NOT using JavaScript at all, and strictly only use css to style my website, or can it be that there are other things to look at??

Along with the two warnings, I also almost always get “Page Loading Issue” on the test results. Could it be that this is a server speed related issue? I am located in Japan at the moment, and my website is also targeted mainly for Japanese, but I am using a SiteGround server and not a Japanese server. I am well aware that this is giving me a speed-related issue in general for my website, but is this also affecting the results of the above-mentioned google tests?

How to Use Robots.txt to block GoogleBot, but not AdsBot

Hi,

I am managing an eCommerce brand that has thousands of products. Some of these products have multiple SKUs (variants in terms of colour). These multiple SKU's use a URL query string parameter to differentiate between the colour variants. Since they are the same product, but only vary by colour, they are all canonicalised to the non-colour version for SEO.

Example setup of products:

Hugo Boss T-Shirt product page (/product/hugo-boss-red-t-shirt) with the below…

How to Use Robots.txt to block GoogleBot, but not AdsBot

Is there a robots.txt interpreter for those of us who are newbies?

Here's my robots.txt file. Is it at least benign, and not harming my site?

User-agent: *
Disallow: /test/
Disallow: /JPG/
Disallow: /PHOT/
Disallow: /VID/

Those disallows, are MEANT to keep search engines from indexing those subdirectories in my domain, so they don't index spurious side stuff that I'm just storing in them.

Is it correct syntax?