We need to crowdsource a db of websites that have opted to exclude themselves from the #WaybackMachine. The WM has become essential with so many Tor-blocking and #Cloudflare blocking sites. I don't want to see WM-excluded sites in my search results. Such archive-resisting sites also downgrade blogs (a dead link invalidates part of an article when there's no archive)
If you discover a website that has opted out of archive.org's #WaybackMachine, there is now a place where you can list them: https://git.sdf.org/deCloudflare/deCloudflare/src/branch/master/anti-tor_users/misc/blocking_archiveorg.md
@edsu it has moved to https://git.nogafam.es/deCloudflare/deCloudflare/src/branch/master/anti-tor_users/fqdn/antiarchive.txt I'm not clear on what legitimate reason you have in mind for blocking bots from harvesting. Can you give an example?
@edsu Archive.org gives publishers control, most likely to avoid legal problems. So while it is up to the publisher, as users we have a right to judge that. Now that the #WaybackMachine has become indispensible (due to Tor-hostility), those who act against WBM act against Tor & thus against privacy. They are not our friends and we have a right to resist propagation of their website URLs.
@edsu The blocklist is merely objective data for people to use as they see fit. What I hope will happen is someone will cross-reference the wbm blocklist with Tor-blocking sites, and reduce search rankings of sites that block both.
@resist1984 I don't know if this is helpful, but you could try to collect some leads for the list from Google, for example: https://www.google.com/search?q=filetype%3Atxt+ia_archiver&hl=en&ei=pjOIYO3oGLKy5NoP6c-TkA4&oq=filetype%3Atxt+ia_archiver&gs_lcp=Cgdnd3Mtd2l6EANQxg5YxStgvC5oAXAAeAGAAW2IAeYOkgEEMjQuM5gBAKABAaoBB2d3cy13aXrAAQE&sclient=gws-wiz&ved=0ahUKEwjt0_ew5J7wAhUyGVkFHennBOIQ4dUDCA0&uact=5
@resist1984 I mean, maybe it's to avoid legal problems, but I like to think it's also because they recognize it's the right thing to do. There are lots of shades of gray in the world and the world wide web is no exception.
@resist1984 thanks! Generally speaking I think that's up to the publisher to decide. The Internet Archive doesn't own the web and if you don't want them to serve up your content in perpetuity I think that's ok.