Internet Archive to ignore robots.txt directives


#1

Originally published at: http://boingboing.net/2017/04/22/internet-archive-to-ignore-rob.html


#2

This may backfire in that websites will start blocking the IA’s crawlers outright. I can easily imagine someone maintaining a blacklist of IA’s ip addresses and either outright giving it away or selling it to owners of shady, manipulative websites.

But, I think stopping the practice of retroactively applying robots.txt is a good thing all around. That assumed a level of good-faith on the part of domain owners above and beyond what anyone else did.


#3

Though at this very moment I am using the trick mentioned in the blog post’s comments to get some blocked history that I’ve dearly wanted (and is still blocked), I’ll be satisfied if they just switch to using the historical robots.txt to avoid the domain squatter problem and the less common problem of a domain owner and site owner getting into a disagreement and the domain owner redirecting to a new site, and let site owners request in e-mail that a newly applied robots.txt be applied retroactively to the rest of their site’s history (not the domain’s).


#4

I can see both sides of this. A domain squatter can take over an old domain, put in a robots.txt, and boom, all history is gone forever. On the other hand is kinda shitty that this also ignores incidents where site owners legitimately don’t want their site scraped.

Also speaking from personal experience getting previously owned domains scrubbed from the Internet Archive is a right pain in the ass.


#5

You’d think this would be the correct policy that they should have been following all along, but this seems to be one of the many areas in tech where a commonsense solution being technically inconvenient for a programmer to implement causes it to be declared impossible.


#6

Why should this even be possible? If someone publishes a book, newspaper, or magazine and you later take over the company you can hardly expect to remove all copies from all the libraries in the world; why should it be different for digitally delivered material?


#7

How is it inconvenient? All they need do is follow the robots.txt that is in force at the time of the crawl. They don’t need to purge the archive of previous crawls. Surely the original policy was more work for both the programmers and the machines.


#9

I can see obeying a robots.txt request to not scrape the current site, but not to remove all content archived under previous policies or owners that allowed it.


#10

I used to a have a domain where I hosted a blog along with other personal and potentially embarrassing content (this was a long time ago back when I was young and before I cared about privacy). I eventually sold the domain name which changed hands a few times after I sold it and ended up in squatter hell.

Later on realized I wanted my old content purged from archive.org. I was able to get the archived content purged but it was not a simple undertaking.


#11

I think the first time I learned of robots.txt – quite possibly before I even learned of the Internet Archive – it was because domain scrapers could suck up an enormous quantity of bandwidth and those without accomodating hosts needed some way to shield themselves from astronomical bills. Maybe that’s not a thing anymore?

[quote=“kwhitefoot, post:6, topic:99617”]Why should this even be possible? If someone publishes a book, newspaper, or magazine and you later take over the company you can hardly expect to remove all copies from all the libraries in the world; why should it be different for digitally delivered material?[/quote]Well, there’s the old adage that digitally-delivered material can be duplicated infinitely at trivial expense and thus differs fundamentally from books, newspapers and magazines.


#12

Badly-behaved scrapers would just ignore your robots.txt anyway, or worse, use it to find the things that you didn’t want scraped. Really, it’s just a polite way to tell good scrapers what isn’t worth their time looking at.


#13

This is great! Looking forward to seeing it work on a number of domains that were parked before their time.


#14

It’s a splendid remember that nothing published on the web is ever meaningfully private, and will always go on your permanent record.

This is definitely not wrong. But it’s also something that, at some point, we’re going to have to stop treating as an inevitability. Hopefully (if we don’t drown in rising seas, etc.) society will co-evolve with our information storage and retrieval technologies in such a way that we start taking things like a right to privacy or ownership of one’s identity or the “right to be forgotten” seriously. (That’s not an endorsement of the embryonic form that European courts are crafting on the fly, but I hope they’re not the last word on it, either.)

I think internet culture–and culture in general–will be healthier in the long run if it doesn’t take place in an all-seeing, all-remembering, all-cross-referencing panopticon. And part of getting to that will eventually involve abandoning the maximally defensive position you’re articulating here. Again, I don’t disagree with the factual validity of that statement–this comment will exist for longer than BB does–I just think that at some point it becomes an impediment.


#15

Archive.is doesn’t seem to care about the directive.


#16

That doesn’t appear to be a bot (crawler/spider) as such. Their own explanation:

One of the issues that robots.txt was designed to help with was that (well-intentioned) crawlers would recursively crawl all links on a page, then all links on those pages and so on. In the early days when the web was becoming dynamic, there were often cases where a page could be referred to by many different URLs, the crawlers could get stuck in loops and burn massive bandwidth. Some badly-designed sites would trigger side-effects whenever a page was hit, etc. ‘Canonical’ URLs and generally better-designed dynamic systems have made those things less of an issue. But it was always about telling well-intentioned crawlers how not to accidentally bork things. Never about preventing archiving, human-directed pageloads, or bad crawlers.


#17

Or humans will adapt by not caring about privacy any more. I actually think that one is a little more likely – people can get used to nearly anything.


#18

Never going to happen. Privacy is intrinsic to the human condition. Privacy is key to everything we do in life. Without privacy there is no self. We are a social species but we are also individuals. Without privacy there is no space to be yourself.


#19

Can you provide evidence or argument to the effect that privacy is intrinsic to the human condition? Here you have only asserted it without providing any reasons why I should believe it’s necessary. Even if it’s key to everything we do in life now (which i do not believe is true in the first place), that does not indicate that it’s necessary to every possible way of life.

My view is that some cultures value privacy more highly than others (e.g. personal space in US vs. Japan), which demonstrates that the degree to which privacy is valued is variable. It’s therefore completely possible for the extent to which privacy is valued in our culture to be reduced in reaction to the state of affairs on the internet – in fact, I think this is pretty much inevitable. In fact, it’s already happening – millenials already have a different attitude towards privacy than older generations.


#20

Privacy is what lets us figure out who we are. If everything we do in life becomes a permanent record for everyone to see and judge us on in perpetuity then there is no space to experiment. It will grind out all creativity and spontaneity from everyone who isn’t a sociopath.

“Without privacy, at no time are you permitted to have a space that is only just for you.”
– Snowden

What were you saying about assertions without evidence or argument?


#21

What you seem to me to be saying here is “It would be bad if that happened”, not “that couldn’t possibly happen”. You still haven’t offered any compelling reasons why it couldn’t possibly happen. I agree that it would be bad, but lots of bad things happen all the time!

Ever heard the expression “you get what you give”? You set the standards for the discussion. Why do you expect more rigor from me than from yourself? You also made the initial claim, which puts the burden of evidence on you. It’s not really fair to make a strong claim and then demand evidence from people skeptical of your claim.

My information here is largely anecdotal, but here’s an argument in lieu of evidence: millennials have spent more of their lives online than previous generations, including a great deal of social media updates, blog entries, etc. about their personal lives throughout their youths. It also seems likely (though I don’t have data to hand) that they spend more time online than older generations.

Thus, millenials have a great deal more data about their lives online than previous generations (proportional to amount of time lived). If they tended to show more concern about privacy than older generations, then it would make sense to infer that they are indeed concerned about privacy. However, if they show less or about the same concern, then it seems that they must be less concerned about privacy than previous generations because there is more potentially embarrassing information available about them on the internet than for people in older generations (on average).

The fact that there is no clear emphasis on privacy for millennials when there is so much more information available about them online demonstrates that they are not as concerned about privacy as previous generations.

A few studies that demonstrate that millennials are not significantly more concerned about privacy than non-digital natives: