Stack Overflow stopped publishing its Data Dump

avidseeker@lemmy.one · 1 year ago

Stack Overflow stopped publishing its Data Dump

cwagner@discuss.tchncs.de · 1 year ago

Stack Overflow senior leadership is working on a strategy to protect Stack Overflow data from being misused by companies building LLMs. While working on this strategy, we decided to stop the dump until we could put guardrails in place.

I do not see this working in any way :( Might be time do delete my SO history as well.

tojikomori@kbin.social · edit-2 1 year ago

This reply’s interesting:

How can data licensed under the CC-BY-SA licenses (that SO content is licensed under) be “misused”? The license explictly allows others to do essentially anything they want with the data as long as attribution is given, in particular profit off of it.

When SO content is applied as parametric knowledge I’d expect the outcome to fail both the “BY” and the “SA” clauses, since model interpreters can’t provide attribution for it and their output won’t share the license. That’s true even if output is considered public domain: CC-BY-SA content can’t be moved into a public domain equivalent license. It seems practically indistinguishable from using any other in-copyright content as training material.

None of that’s to say SO is right to stop data dumps. It feels like they’re trying to find a technical solution to a legal problem, perhaps even one that rises to criminality on the part of Open AI and others?

Naatan@lemmy.ml · 1 year ago

“Misused”. Gotta love these sites hoarding the content that was produced by its users.

I can’t wait till all these services are dead and decentralized ones have taken their place. Realize that’s by no means a guarantee, but I can hope damnit…

Stack Overflow stopped publishing its Data Dump

Stack Overflow stopped publishing its Data Dump

June 2023 Data Dump is missing