Web Crawling system design question

I am crawling a realestate website, and the idea is 1, to crawl every single day and store the differences to the database. 2, When the property is sold, I’ll update the db too.

The challenges are: 1, How do I model the data in Database? I run a scheduler to Run Scrapy every day, I assume it brings no benefit to store the (most likely) the same data over and over, I only need to store the changes to the data crawled the first time. If it has, property address, title, price guide, description, agent name, for example, do I need to make all these fields separate tables to store the historical changes?

2, How do I merge/insert the new data in Database? When everyday it runs scrapy to get the new data, it will update to the existing database to have what I was talking about above – the historical data (diff/changes) rather than all data again ( I assume this is a waste of space?)

3, regarding idea #2, the technical challenge I’m facing is, since I’m crawling the buy category in the realestate website, once the property is sold it will be removed and added into the sold category. To be able to find the sold property that I have been tracking before, I think I’ll have to loop through all rows in my database to get the property ID and attach it to the url and crawl them again to get new information. Now, how do I model this in database? what’s the appropriate way for me to track it so that I know which one I need to crawl again?

Overall these are the biggest challenges. I know I may be totally in wrong direction. There’s some other tools/techniques that may help but I’m not sure. RabbitMQ, Redis. How about these?

Many thanks!!!

Java Web scraper and Web Crawling

My intention is to read cost details of a product from various websites , so that i can display cost comparison details in a html page of my Spring application. Can anyone suggest me on how to do it . is there any technologies to achieve this ? so that i can always read the updated data from other websites and display it in my Spring application. I saw some Web scrapper tools as a Chrome extension but it generates an Excel workbook. how could i use it in my Spring application and display it in HTML page ?

Crawl your links fast by Rocket Fast Indexer (2000 Link’s crawling plan) for $19

Rocket Fast Indexer – Index your 1000 links faster! For $ 19 With the help of 15 years+ experience of SEO, Link building & Online marketing. I find that indexing of the backlinks has been always a big challenge for all of us – The Link builders. It’s, even more, harder nowadays after recent updates, as we all know. I have made my private “link crawler cum indexer”, Here are its features: – We don’t make any form of links for your backlinks. – The uniquely designed approach by me – Technically it Softly asks the G-bot to crawl.No extra investment of VPS or proxies or aged G accounts.Generally seen links starts indexing within 2-3 days of processing, most of the time even earlier too. Have recorded in our live tests up to 80% links indexing noticed in 1-2 months. Introductory Offer: 15% Off Coupon Code: 15off Give it a try with a minimum small plan and see the difference yourself before ordering any more.

by: nidhim
Created: —
Category: Link Building
Viewed: 263


Few Websites are either Copying or crawling my content what Should I do? [duplicate]

This question already has an answer here:

  • Our website was copied 100% and mirrored on a different domain 6 answers
  • Another website is mirroring and ranks above my site in search results 6 answers
  • How much of your content needs to be copied before you can file a DMCA complaint? 1 answer

I have noticed quite a change in my SERP over the months and came to know few sites have copied my content. How Should I Complain this to Google?

Is there a General Criteria for this?

There are Google provisions that I understand. But are their any clear policies regarding this?

Google publishes documentation on dynamic rendering for crawling, indexing JavaScript

If your site runs on JavaScript you might find this useful:

Quote:

…Google announced on Twitter Wednesday morning that it has published help documentation around what is dynamic rendering, when to use it, and how to implement it. This help documentation is designed as a workaround for webpages that deploy a form of JavaScript that makes it hard for Google to properly crawl, index and rank those pages in search…


Google publishes documentation on dynamic rendering for crawling, indexing JavaScript webpages
October 3, 2018 Search Engine Land