When I use the Python package
newspaper3k package and run the code
import newspaper paper = newspaper.build('http://abcnews.com', memoize_articles=False) for url in paper.article_urls(): print(url)
I get a list of URLs for articles that I can download, in which both these URLs exist
As can be seen, the only difference between the two URLs is the
The question is, can the webpage content differ simply because an
s is added to
http? If I scrape a news source (in this case http://abcnews.com), do I need to download both articles to be sure I don’t miss any article, or are they guaranteed to have the same content so that I can download only one of them?
I have also noticed that some URLs also are duplicated by adding
www. after the
https://). I have the same question here: Can this small change cause the webpage content to differ, and is this something I should take into account or can I simply ignore one of these two URLs?