Cleaning Up Harvested URL’s

Hi there,

I am sure there would be an option but am not sure which one or how it would be done. Like we harvest lots of url’s , and the process is we remove duplicates.

Then I want to remove the url’s with certain words like ;

youtube.
wiki
cnn
bbc

So what I want is perhaps create a file or I did find a blacklist word and edited , putĀ  those words in it , and removed those but those url’s still remained , so maybe there is something wrong with how I am doing it.

Also would be great to know if you guys could guide how I can harvest so that these url’s containing those stop words are not harvested.

Thanks again