I want to allow crawling of my website only if the URL starts with accepted region/language combinations which are
fr/fr. Other language combinations must be restricted. On the other hand crawler should be able to crawl
/about, etc. For example:
example.com/us/en/videos # should be allowed example.com/de/en/videos # should be blocked example.com/users/mark # should be allowed
Again, it should be blocked only if it starts with unaccepted region/language combinations. What I did so far does not work:
Disallow: /*? Disallow: /*/*/cart/ Disallow: /*/*/checkout/ Disallow: /*/*/ Allow: /*.css? Allow: /*.js? Allow: /us/en/ Allow: /gb/en/ Allow: /fr/fr/
I tested it with google’s online robots.txt tester.
Let’s say that
googlebot is scraping
example.com has a
robots.txt file that disallows
/endpoint-for-lazy-loaded-content, but allows
/page lazy loads content using
Does googlebot see the lazy loaded content?
I just discovered our image system domain is not being crawled by Google for a long time. The reason is that all the URLs seem to be blocked by
robots.txt — but I don’t even have one.
Disclaimer: Due to some config testing, I now have a generic allow-everything robots file at the website root. I didn’t have one prior to this hour.
We run an image resizing system at a subdomain of our website. I’m getting a very weird behaviour as Search Console claims to be blocked by
robots.txt, when in fact I don’t even have one in the first place.
All URLs at this subdomain give me this result when live testing them:
Trying to debug the issue, I created a robots.txt at the root:
The robots file is even already visible at search results:
The response headers also seem to be ok:
HTTP/2 200 date: Sun, 27 Oct 2019 02:22:49 GMT content-type: image/jpeg set-cookie: __cfduid=d348a8xxxx; expires=Mon, 26-Oct-20 02:22:49 GMT; path=/; domain=.legiaodosherois.com.br; HttpOnly; Secure access-control-allow-origin: * cache-control: public, max-age=31536000 via: 1.1 vegur cf-cache-status: HIT age: 1233 expires: Mon, 26 Oct 2020 02:22:49 GMT alt-svc: h3-23=":443"; ma=86400 expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" server: cloudflare cf-ray: 52c134xxx-IAD
Here are some sample URLs for testing:
https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_zg1YXWVbJwFkxT_ZQR534L90lnm8d2IsjPUGruhqAe.png.jpeg https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/10/legiao_FPutcVi19O8wWo70IZEAkrY3HJfK562panvxblm4SL.png.jpeg https://kanto.legiaodosherois.com.br/w760-h398-gnw-cfill-q80/wp-content/uploads/2019/09/legiao_gTnwjab0Cz4tp5X8NOmLiWSGEMH29Bq7ZdhVPlUcFu.png.jpeg
What should I do?
For some reason, Google is indexing the robots.txt file for some of our sites and showing it in search results. See screenshots below.
Our robots.txt file is not linked from anywhere on the site and contains just the following:
This only happens for some sites. Why is this happening and how do we stop it?
Screenshot 1: Google Search console…
Why is Google indexing our robots.txt file and showing it in search results?
In my adsense account in "revenue optimization" i have crawl errors then when i click "fix crawl errors" then this..
Blocked Urls Error
http://www.rechargeoverload.in/search/label Robot Denied
Robots.txt is blocking my labels
I want to understand how the robots.txt file can be use by an attacker. I know we can have a list of paths and directories , that’s all or we can find more informations ?
I have a website (built with MediaWiki 1.33.0 CMS) which contains a robots.txt file.
In that file there is one line containing the literal domain of that site:
I usually prefer to replace literal domain referrals with a variable value call that will somehow (depends on the specific case) will be changed in execution to the value which is the domain itself.
An example to a VVC would be a Bash variable substitution.
Many CMSs have a global directives file which usually contains the base address of the website:
In MediaWiki 1.33.0 this file is
LocalSettings.php which contains the base address in line 32:
$ wgServer = "https://example.com";
How could I call this value with a variable value call in robots.txt?
This will help me avoid confusion and malfunction if the domain of the website is changed; I wouldn’t have to change the value manually there as well.
One can sort
robots.txt this way:
User-agent: DESIRED_INPUT Sitemap: https://example.com/sitemap-index.xml Disallow: /
User-agent: DESIRED_INPUT Disallow: / Sitemap: https://example.com/sitemap-index.xml
I assume both are okay because it’s likely the file is compiled in correct order by generally all crawlers.
Is it a best practice to put
Sitemap: to prevent an extremely unlikely bug of a crawler’s bad compilation of crawling before ignoring
Google Search Console and Mobile-Friendly Test both give me the following two warnings for my WordPress based website:
- Content wider than screen
- Clickable elements too close together
The screenshot that these sites provide of my website completely looks broken as if no CSS was applied.
My case was different. The following state is how my robots.txt file looks like, and I still get the same warning messages none the less. I am an SEO framework user, so I created my own static version of the robots.txt.
User-agent: * Allow: / Sitemap: https://*****
Along with the two warnings, I also almost always get “Page Loading Issue” on the test results. Could it be that this is a server speed related issue? I am located in Japan at the moment, and my website is also targeted mainly for Japanese, but I am using a SiteGround server and not a Japanese server. I am well aware that this is giving me a speed-related issue in general for my website, but is this also affecting the results of the above-mentioned google tests?
I am managing an eCommerce brand that has thousands of products. Some of these products have multiple SKUs (variants in terms of colour). These multiple SKU's use a URL query string parameter to differentiate between the colour variants. Since they are the same product, but only vary by colour, they are all canonicalised to the non-colour version for SEO.
Example setup of products:
Hugo Boss T-Shirt product page (/product/hugo-boss-red-t-shirt) with the below…
How to Use Robots.txt to block GoogleBot, but not AdsBot