I’ve build a scraper that gathers articles on a the main page of a news website every 15 minutes (this website really looks like Google News). I give those articles a kind of weight based on their ranking on the main page, and I also monitor their display time on this main page.
So, for instance:
- During its first run started at time T1, the scraper collects article A, article B and article C. A appears first on the main page of the news website, and is followed by B and C. I give A a score of 3, B a a score of 2 and C a score of 1 (score = number or items within the list – index, the list is zero-indexed).
- 15 min later, at time T2, the scraper collects article D (a new article at 1st position), followed by A and B (C does not appear on the list anymore). Thus D score = 3, A score = 3 (score obtained during the scraper’s first run) + 2 (score obtained on the second run), B score = 2 + 1.
- At this point, I know that A & B have probably been displayed at least for 15min on the main page so I give them a display time of 15 minutes.
- Third run, time T3: B is still there, at the third position. All the other articles are gone from the main page. So B score = 2 + 1 + 1 and display time = 15 min + 15 min.
I’d like to devise an algorythm that would allow me to say with certainty which articles have been the most visible during a given time period (most visible = displayed the most and at the highest position on the main page during the time period).
I thought of calculating score of the article / number of times my scrapper has run on the time period (which is a (timedelta / 15) + 1, since my scrapper runs every 15 min after it is launched), but this is not satisfying if I want to give more importance to the display time.
For instance, with the case described above, I have T3 – T1 = 30 min, script ran (30/15+1=3 times). Score of the article / number of times my scrapper has run on the time period would give 5/3 for article A and 4/3 for article B. So A has been more “important” over those 30 min.
Now imagine I run the script a 4th time and it still finds B at the third place on the main page. That would give A a score of 5/4 and B a score of 5/4 as well. However, one can argue they don’t have the same importance, as B could be seen on the main page for 45 min straight while A just sat there for 15 min (at a better place, though).
Do you have any piece of advice to improve the way I calculate the importance score of each article? Sorry for the long post, I know this is basic maths but I’m a bad student 😉