The data dump usually get refreshed the first weekend of the month, every 3 months.
The current Data dump is still from March. Is there just a problem and it's delayed like in the past?
Stack Overflow senior leadership is working on a strategy to protect Stack Overflow data from being misused by companies building LLMs. While working on this strategy, we decided to stop the dump until we could put guardrails in place.
I do not see this working in any way :( Might be time do delete my SO history as well.
How can data licensed under the CC-BY-SA licenses (that SO content is licensed under) be “misused”? The license explictly allows others to do essentially anything they want with the data as long as attribution is given, in particular profit off of it.
When SO content is applied as parametric knowledge I’d expect the outcome to fail both the “BY” and the “SA” clauses, since model interpreters can’t provide attribution for it and their output won’t share the license. That’s true even if output is considered public domain: CC-BY-SA content can’t be moved into a public domain equivalent license. It seems practically indistinguishable from using any other in-copyright content as training material.
None of that’s to say SO is right to stop data dumps. It feels like they’re trying to find a technical solution to a legal problem, perhaps even one that rises to criminality on the part of Open AI and others?
“Misused”. Gotta love these sites hoarding the content that was produced by its users.
I can’t wait till all these services are dead and decentralized ones have taken their place. Realize that’s by no means a guarantee, but I can hope damnit…
I do not see this working in any way :( Might be time do delete my SO history as well.
This reply’s interesting:
When SO content is applied as parametric knowledge I’d expect the outcome to fail both the “BY” and the “SA” clauses, since model interpreters can’t provide attribution for it and their output won’t share the license. That’s true even if output is considered public domain: CC-BY-SA content can’t be moved into a public domain equivalent license. It seems practically indistinguishable from using any other in-copyright content as training material.
None of that’s to say SO is right to stop data dumps. It feels like they’re trying to find a technical solution to a legal problem, perhaps even one that rises to criminality on the part of Open AI and others?
“Misused”. Gotta love these sites hoarding the content that was produced by its users.
I can’t wait till all these services are dead and decentralized ones have taken their place. Realize that’s by no means a guarantee, but I can hope damnit…