AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums
AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums
"This is a moment where that community feels collectively under threat and isn't sure what the process is for solving the problem.”Emanuel Maiberg (404 Media)
LandedGentry
in reply to sabreW4K3 • • •asfasdfasdfas
:::
FaceDeer
in reply to LandedGentry • • •This seems contradictory. On the one hand you're saying that these works are wrongly locked behind paywalls, but on the other you're saying that scraping them is an "assault on the cornerstones of our public knowledge." Is this information supposed to be freely viewable or not?
IMO the ideal solution would be the one Wikimedia uses, which is to make the information available in an easily-downloadable archive file. That lets anyone who wants the whole thing to have it without having to "hammer" the servers. Meanwhile the servers can be protected by standard load-balancing and DDOS prevention systems.
don't like this
LandedGentry doesn't like this.
LandedGentry
in reply to FaceDeer • • •asfasdfasdfas
:::
FaceDeer
in reply to LandedGentry • • •Was Aaron Schwartz wrong to scrape those repositories? He shouldn't have been accessing all those publicly-funded academic works? Making it easier for him to access that stuff would have been "capitulating to hackers?"
I think the problem here is that you don't actually believe that information should be free. You want to decide who and what gets to use that "publicly-funded academic work", and you have decided that some particular uses are allowable and others are not. Who made you that gatekeeper, though?
I think it's reasonable that information that's freely posted for public viewing should be freely viewable. As in anyone can view it. If they want to view all of it and that puts a load on the servers providing it, but there's an alternate way of providing it that doesn't put that load on the servers, what's wrong with doing that? It solves everyones' problems.
LandedGentry doesn't like this.
LandedGentry
in reply to FaceDeer • • •asfasdfasdfas
:::
FaceDeer
in reply to LandedGentry • • •LandedGentry doesn't like this.
LandedGentry
in reply to FaceDeer • • •asfasdfasdfas
:::
FaceDeer
in reply to LandedGentry • • •LandedGentry doesn't like this.
LandedGentry
in reply to FaceDeer • • •asfasdfasdfas
:::
FaceDeer
in reply to LandedGentry • • •They put the website up. Load balancing, rate limiting, and such go with the turf. It's their responsibility to make the site easy to use and hard to break. Putting up an archive of the content that the scrapers want is an easy and straightforward thing to do to accomplish this goal.
I think what's really going on here is that your concern isn't about ensuring that the site is up, and it's certainly not about ensuring that the data it's providing is readily available. It's that there are these specific companies you don't like and you just want to forbid them from accessing otherwise freely accessible data.
don't like this
LandedGentry doesn't like this.
LandedGentry
in reply to FaceDeer • • •asfasdfasdfas
:::
FaceDeer
in reply to LandedGentry • • •Yes. Which is why I'm suggesting providing an approach that doesn't require scraping the site.
LandedGentry doesn't like this.
LandedGentry
in reply to FaceDeer • • •asfasdfasdfas
:::
FaceDeer
in reply to LandedGentry • • •Perhaps be more succinct? You're really flooding the zone here.
No, I'm staying focused.
LandedGentry
in reply to FaceDeer • • •