A.I.’s un-learning problem: Researchers say it’s virtually impossible to make an A.I. model ‘forget’ the things it learns from private user data

assassin_aragorn@lemmy.world · 2 years ago

A.I.’s un-learning problem: Researchers say it’s virtually impossible to make an A.I. model ‘forget’ the things it learns from private user data

Primarily0617@kbin.social · edit-2 2 years ago

it’s crazy that “it’s too hard :(” has become an acceptable justification for just ignoring the law within tech circles

BrianTheeBiscuiteer@lemmy.world · 2 years ago

I’m not an AI expert, and I wouldn’t say it is too hard, but I believe removing a specific piece of data from a model is like trying to remove excess salt from a stew. You can add things to make the stew less salty but you can’t really remove the salt.

The alternative, which is a lot of effort but boo-hoo for big tech, is to throw out the model and start over without the data in question. These companies would do well to start with models built on public or royalty free data and then add more risky data on top of that (so you only have to rebake starting from the “public” version).

Primarily0617@kbin.social · 2 years ago

sounds like big tech shouldn’t have spent the last decade investing in a kitchen refit so that they could make stew really well but nothing else

GoosLife@lemmy.world · edit-2 2 years ago

If there’s something illegal in your dish, you throw it out. It’s not a question. I don’t care that you spent a lot of time and money on it. “I spent a lot of time preparing the circumstances leading to this crime” is not an excuse, neither is “if I have to face consequences for committing this crime, I might lose money”.

Robaque@feddit.it · 2 years ago

Perhaps long pig stew could serve as an apt comparison, lol

Quokka@quokk.au · 2 years ago

Fuck no.

It’s illegal to be gay in many places, should we throw out any AI that isn’t homophobic as shit?

GoosLife@lemmy.world · 2 years ago

No, especially because it’s not the same thing at all. You’re talking about the output, we’re talking about the input.

The training data was illegally obtained. That’s all that matters here. They can train it on fart jokes or Trump propaganda, it doesn’t really matter, as long as the Trump propaganda in question was legally obtained by whoever trained the model.

Whether we should then allow chatbots to generate harmful content, and how we will regulate that by limiting acceptable training data, is a much more complex issue that can be discussed separately. To address your specific example, it would make the most sense that the chatbot is guided towards a viewpoint that aligns with its intended userbase. This just means that certain chatbots might be more or less willing to discuss certain topics. In the same way that an AI for children probably shouldn’t be able to discuss certain topics, a chatbot that’s made for use in highly religious area, where homosexuality is very taboo, would most likely not be willing to discuss gay marriage at all, rather than being made intentionally homophobic.

Quokka@quokk.au · 2 years ago

The output only exists from the input.

If you feed your model only on “legal” content, that would in many places ensure it had no LGBT+ positive content.

Legality (and the dubious nature of justice systems) of training data is not the angle to be going for.

GoosLife@lemmy.world · 2 years ago

You seem to think the majority of LGBT+ positive material is somehow illegal to obtain. That is not the case. You can feed it as much LGBT+ positive material as you like, as long as you have legally obtained it. What you can’t do is train it on LGBT+ positive material that you’ve stolen from its original authors. Does that make more sense?

Quokka@quokk.au · 2 years ago

You do know being LGBT+ in many places is illegal, right? And can even carry the death penalty.

Legality is not important and we should not care if it’s considered legal or not, because what’s legal isn’t what’s right or ethical.

GoosLife@lemmy.world · 2 years ago

Yes I am aware of that. However, I’m not sure how this has anything to do with the fact that it is also illegal to steal data, then continue to use said data to make profits after having been found out. The two are not connected in any logical way, which makes it hard for me to continue to address your concerns in a way that makes sense.

The way I see it, you’re either completely missing what we’re talking about, or you have some misunderstanding of what the AI language models actually are, and what they can do.

For the record, I’m in no way disagreeing with your views, or your statements that legal and ethical don’t always overlap. It is clear to me that you are open minded and well-intended, which I appreciate, and I hope you don’t take this the wrong way.

Grandwolf319@sh.itjust.works · 2 years ago

Replace salt with poison or an allergenic substance and if fully holds. If a batch has been contaminated, then yes, you should try again.

But now that the cat is out of the bag, other companies are less willing to let something be scrap able due to how valuable it can be.

I think big tech knew this, that they can only build these models on unfiltered data before the AI craze.

lightnsfw@reddthat.com · 2 years ago

It will probably be way shittier without all the private data they put in the first time too.

Tyfud@lemmy.one · edit-2 2 years ago

I work in this field a good bit, and you’re largely correct. That’s a great analogy of trying to remove salt from a stew. The only issue with that analogy is that that’s technically possible still by distilling the stew and recovering the salt. Even though it would destroy the stew.

At the point that pii data is in the model, it’s fully baked. It’d be like trying to get the eggs out of a baked cake. The chemical composition has changed into something else completely.

That’s how building a model works today. Like baking a cake.

I’m order to remove or even identify pii data in ML models or LLMs today, we’d need a whole new way of baking a cake that would keep the eggs separate from the cake until just before you tried to take a bite out of it. The tools today don’t allow you to do anything like that. They bake you a complete cake.

Fushuan [he/him]@lemm.ee · 2 years ago

Something to take in mind is that yes, they would need to retrain the models from zero, but if they did it in any kind of basic decent method they should have backups and versions of the data they used to train and they would need to retrain everything with a subset of the original data. Then, the optimizations they have already applied to the system should be able to be reapplied in the same manner and the product should be somewhat similar. Another thing would be to design a de training process, where you generate an input from the “must be deleted” input that when trained acts as some sort of “negative input” and the model ends up in the same place it would have ended up if it were not trained with the “must be deleted” data.

I bet you that if governments act harsh enough tech companies will develop some sort of “negative training”.

In the end this is a solvable math optimization problem, what input do I need to feed the already trained model for it to become the equivalent model it would be if trained without the requested data.

We could even create an ML model that computes a “good enough negative input” from several examples, since testing the quality of the results is quite simple, and we can train it with several trained model examples. This model would be fed with a base model, some input data and another base model trained without that data.

All in all, AI companies will tell you that this is very hard because they would essentially be investing hours and development to create a tool that makes their model worse instead of better, so expect a lot of pushback.

Zeth0s@lemmy.world · edit-2 2 years ago

It’s actually a pretty normal thing in law. Laws are created with common sense in mind and compromises.

Currently EU laws do not cover generative AI. Now EU needs to decide how to deal with it. If consider it as a “lossy compressed database”, trying to enforce a variation of gdpr with added fuzziness, or do something else

Alien Nathan Edward@lemm.ee · 2 years ago

I just saw an article that said that ISPs are trying to whine their way out of listing the fees they charge because it’s too hard. Which is wild because they certainly know what I owe them after I sign the contract, but somehow it’s just impossible for them to determine right up until the moment that I’m obligated to pay it.

garyyo@lemmy.world · 2 years ago

Always has been. The laws are there to incentivize good behavior, but when the cost of complying is larger than the projected cost of not complying they will ignore it and deal with the consequences. For us regular folk we generally can’t afford to not comply (except for all the low stakes laws that you break on a day to day basis), but when you have money to burn and a lot is at stake, the decision becomes more complicated.

The tech part of that is that we don’t really even know if removing data from these sorts of model is possible in the first place. The only way to remove it is to throw away the old one and make a new one (aka retraining the model) without the offending data. This is similar to how you can’t get a person to forget something without some really drastic measures, even then how do you know they forgot it, that information may still be used to inform their decisions, they might just not be aware of it or feign ignorance. Only real way to be sure is to scrap the person. Given how insanely costly it can be to retrain a model, the laws start looking like “necessary operating costs” instead of absolute rules.