I am crawling a SharePoint2007 site from SharePoint2013 including a lot of folders and different filetypes. Only the PDF files are relevant to search. The crawl is very slow and I am interested in minimizing things in my index. Is it possible to set up at crawl rule that would crawl a certain file extension only?
We have about 10 million items in our webapp that we crawl. The owners created a special search page that searches the items using the search API but only searches the filename and metadata, not the contents. To speed up crawl, is there a way to keep crawl from crawling the contents of the files and only grab the filenames and metadata to populate the index with?
I have scraped a few hundred thousand URLs for emails in the past 2 weeks. All the urls are instagram accounts. I’ve gotten over 30k emails using the “Grab/Check -> Check for emails by crawling sites.”
Things have been working great until yesterday.
I am not getting any errors, but no emails are being collected. All the urls checked are shown to be “complete,” rather than displaying an error message. I manually checked some of the links and there were emails on the pages.
I’m not using any proxies. My delay is 0-2 seconds between actions. I am running on 1 thread. My depth is set to 1 level. I have tried using level 2 depth to see if it would fix my issue but it did not.
I haven’t changed any of my settings since I started scraping 2 weeks ago.
One of my associates (located in different state) is having the same exact issue as I am.
I have a site that usually creates a few thousand pages a day, which don’t change after they have been created. Recently my dedicated server has crashed due to googlebot crawling the site too often. According to the search console, many days googlebot crawls the site tens of thousands of times a day, indicating they keep crawling pages they already crawled. I am aware I can limit the googlebot crawl rate, but is it possible to force googlebot to crawl a page ONCE and ONCE only?
I am crawling a realestate website, and the idea is 1, to crawl every single day and store the differences to the database. 2, When the property is sold, I’ll update the db too.
The challenges are: 1, How do I model the data in Database? I run a scheduler to Run Scrapy every day, I assume it brings no benefit to store the (most likely) the same data over and over, I only need to store the changes to the data crawled the first time. If it has, property address, title, price guide, description, agent name, for example, do I need to make all these fields separate tables to store the historical changes?
2, How do I merge/insert the new data in Database? When everyday it runs scrapy to get the new data, it will update to the existing database to have what I was talking about above – the historical data (diff/changes) rather than all data again ( I assume this is a waste of space?)
3, regarding idea #2, the technical challenge I’m facing is, since I’m crawling the
buy category in the realestate website, once the property is sold it will be removed and added into the
sold category. To be able to find the sold property that I have been tracking before, I think I’ll have to loop through all rows in my database to get the property ID and attach it to the url and crawl them again to get new information. Now, how do I model this in database? what’s the appropriate way for me to track it so that I know which one I need to crawl again?
Overall these are the biggest challenges. I know I may be totally in wrong direction. There’s some other tools/techniques that may help but I’m not sure. RabbitMQ, Redis. How about these?
My intention is to read cost details of a product from various websites , so that i can display cost comparison details in a html page of my Spring application. Can anyone suggest me on how to do it . is there any technologies to achieve this ? so that i can always read the updated data from other websites and display it in my Spring application. I saw some Web scrapper tools as a Chrome extension but it generates an Excel workbook. how could i use it in my Spring application and display it in HTML page ?
I have a website with like 100K URLs I want to disallow crawling for all URL that has an ID with this pattern www.site.com/node/sport/category/id but not all that have www.site.com/node/sport/category/ without the ID.
how can I approach this rule on robots.txt?
Rocket Fast Indexer – Index your 1000 links faster! For $ 19 With the help of 15 years+ experience of SEO, Link building & Online marketing. I find that indexing of the backlinks has been always a big challenge for all of us – The Link builders. It’s, even more, harder nowadays after recent updates, as we all know. I have made my private “link crawler cum indexer”, Here are its features: – We don’t make any form of links for your backlinks. – The uniquely designed approach by me – Technically it Softly asks the G-bot to crawl.No extra investment of VPS or proxies or aged G accounts.Generally seen links starts indexing within 2-3 days of processing, most of the time even earlier too. Have recorded in our live tests up to 80% links indexing noticed in 1-2 months. Introductory Offer: 15% Off Coupon Code: 15off Give it a try with a minimum small plan and see the difference yourself before ordering any more.
Category: Link Building
Just a questions.
Is it possible for scrapebox to crawl through links and check their html to see if they have the facebook pixel installed on the code.
And save the URL if they dont?
I'm trying to know how many pages do Googlebots crawl daily, or weekly, or monthly? Not trying to find numbers for my specific website; rather,I want general numbers. Is it possible to get those? Cheers, David