I’m rather curious to see how the EU’s privacy laws are going to handle this.
(Original article is from Fortune, but Yahoo Finance doesn’t have a paywall)
it’s crazy that “it’s too hard :(” has become an acceptable justification for just ignoring the law within tech circles
I’m not an AI expert, and I wouldn’t say it is too hard, but I believe removing a specific piece of data from a model is like trying to remove excess salt from a stew. You can add things to make the stew less salty but you can’t really remove the salt.
The alternative, which is a lot of effort but boo-hoo for big tech, is to throw out the model and start over without the data in question. These companies would do well to start with models built on public or royalty free data and then add more risky data on top of that (so you only have to rebake starting from the “public” version).
sounds like big tech shouldn’t have spent the last decade investing in a kitchen refit so that they could make stew really well but nothing else
If there’s something illegal in your dish, you throw it out. It’s not a question. I don’t care that you spent a lot of time and money on it. “I spent a lot of time preparing the circumstances leading to this crime” is not an excuse, neither is “if I have to face consequences for committing this crime, I might lose money”.
Perhaps long pig stew could serve as an apt comparison, lol
Fuck no.
It’s illegal to be gay in many places, should we throw out any AI that isn’t homophobic as shit?
No, especially because it’s not the same thing at all. You’re talking about the output, we’re talking about the input.
The training data was illegally obtained. That’s all that matters here. They can train it on fart jokes or Trump propaganda, it doesn’t really matter, as long as the Trump propaganda in question was legally obtained by whoever trained the model.
Whether we should then allow chatbots to generate harmful content, and how we will regulate that by limiting acceptable training data, is a much more complex issue that can be discussed separately. To address your specific example, it would make the most sense that the chatbot is guided towards a viewpoint that aligns with its intended userbase. This just means that certain chatbots might be more or less willing to discuss certain topics. In the same way that an AI for children probably shouldn’t be able to discuss certain topics, a chatbot that’s made for use in highly religious area, where homosexuality is very taboo, would most likely not be willing to discuss gay marriage at all, rather than being made intentionally homophobic.
The output only exists from the input.
If you feed your model only on “legal” content, that would in many places ensure it had no LGBT+ positive content.
Legality (and the dubious nature of justice systems) of training data is not the angle to be going for.
You seem to think the majority of LGBT+ positive material is somehow illegal to obtain. That is not the case. You can feed it as much LGBT+ positive material as you like, as long as you have legally obtained it. What you can’t do is train it on LGBT+ positive material that you’ve stolen from its original authors. Does that make more sense?
You do know being LGBT+ in many places is illegal, right? And can even carry the death penalty.
Legality is not important and we should not care if it’s considered legal or not, because what’s legal isn’t what’s right or ethical.
Replace salt with poison or an allergenic substance and if fully holds. If a batch has been contaminated, then yes, you should try again.
But now that the cat is out of the bag, other companies are less willing to let something be scrap able due to how valuable it can be.
I think big tech knew this, that they can only build these models on unfiltered data before the AI craze.
It will probably be way shittier without all the private data they put in the first time too.
I work in this field a good bit, and you’re largely correct. That’s a great analogy of trying to remove salt from a stew. The only issue with that analogy is that that’s technically possible still by distilling the stew and recovering the salt. Even though it would destroy the stew.
At the point that pii data is in the model, it’s fully baked. It’d be like trying to get the eggs out of a baked cake. The chemical composition has changed into something else completely.
That’s how building a model works today. Like baking a cake.
I’m order to remove or even identify pii data in ML models or LLMs today, we’d need a whole new way of baking a cake that would keep the eggs separate from the cake until just before you tried to take a bite out of it. The tools today don’t allow you to do anything like that. They bake you a complete cake.
Something to take in mind is that yes, they would need to retrain the models from zero, but if they did it in any kind of basic decent method they should have backups and versions of the data they used to train and they would need to retrain everything with a subset of the original data. Then, the optimizations they have already applied to the system should be able to be reapplied in the same manner and the product should be somewhat similar. Another thing would be to design a de training process, where you generate an input from the “must be deleted” input that when trained acts as some sort of “negative input” and the model ends up in the same place it would have ended up if it were not trained with the “must be deleted” data.
I bet you that if governments act harsh enough tech companies will develop some sort of “negative training”.
In the end this is a solvable math optimization problem, what input do I need to feed the already trained model for it to become the equivalent model it would be if trained without the requested data.
We could even create an ML model that computes a “good enough negative input” from several examples, since testing the quality of the results is quite simple, and we can train it with several trained model examples. This model would be fed with a base model, some input data and another base model trained without that data.
All in all, AI companies will tell you that this is very hard because they would essentially be investing hours and development to create a tool that makes their model worse instead of better, so expect a lot of pushback.
It’s actually a pretty normal thing in law. Laws are created with common sense in mind and compromises.
Currently EU laws do not cover generative AI. Now EU needs to decide how to deal with it. If consider it as a “lossy compressed database”, trying to enforce a variation of gdpr with added fuzziness, or do something else
I just saw an article that said that ISPs are trying to whine their way out of listing the fees they charge because it’s too hard. Which is wild because they certainly know what I owe them after I sign the contract, but somehow it’s just impossible for them to determine right up until the moment that I’m obligated to pay it.
Always has been. The laws are there to incentivize good behavior, but when the cost of complying is larger than the projected cost of not complying they will ignore it and deal with the consequences. For us regular folk we generally can’t afford to not comply (except for all the low stakes laws that you break on a day to day basis), but when you have money to burn and a lot is at stake, the decision becomes more complicated.
The tech part of that is that we don’t really even know if removing data from these sorts of model is possible in the first place. The only way to remove it is to throw away the old one and make a new one (aka retraining the model) without the offending data. This is similar to how you can’t get a person to forget something without some really drastic measures, even then how do you know they forgot it, that information may still be used to inform their decisions, they might just not be aware of it or feign ignorance. Only real way to be sure is to scrap the person. Given how insanely costly it can be to retrain a model, the laws start looking like “necessary operating costs” instead of absolute rules.
“AI model unlearning” is the equivalent of saying “removing a specific feature from a compiled binary executable”. So, yeah, basically not feasible.
But the solution is painfully easy: you remove the data from your training set (ie, the source code), and re-train your model (recompile the executable).
Yes, it may cost you a lot of time and money to accomplish this, but such are the consequences of breaking the law. Maybe be extra careful about obeying laws going forward, eh?
removing a specific feature from a compiled binary executable
That’s actually very feasible. Compiled binaries translate directly to assembly, which is taught to most (all?) comp sci undergrads. When the binary is compiled by a standard compiler the translated assembly is very easy to understand, and for software that has protections/obfuscations like DRM and viruses there are reverse engineering tools like IDA Pro.
Far cheaper to just buy politicians and change the law.
Just ask the AI to do it for you. Much better return on investment.
Retraining the model is incredibly expensive. That basically means not training the model with any user data, even if it slips in accidentally, by someone sabotage the training data, or even with consent (since consent can be revoked).
Yeah, there’s no point in the model where you can pinpoint that data. It’s like asking a brain surgeon to slice your brain to make you forget something. Sure, he could do it, but don’t be surprised if you can’t speak or remember your wife when you wake up…
The only option is to relearn from the new filtered training data, or filter it on the way out, which is likely easier said than done because it has no real context of what it’s doing.
“removing a specific feature from a compiled binary executable”
That’s how patches used to be 😆
Patches today patch source code. The kind of binary patching you talk about only works with deterministic builds, which sadly there’s not enough of out there.
Lemme just say I’m old
I don’t see how that’s related at all. Having deterministic builds only matters if you’re building a binary from source, if you’re working with some distributed binary you’ll be applying the patch to identical binaries anyway. And if a new binary is distributed, that’s going to be because something in the source was changed; deterministic builds will still give you a different binary if the source changes.
Binary patching is still common, both for getting around DRM and for software updates.
A trained AI model is a set of weights that is applied to the given neural network, the difference between two models, one trained without key data and one trained with key data, can be computed and a tool can be created to generate a transformation from model A to model B, or even a good approximation of model B trained with another AI.
It’s not THAT hard actually.
I don’t doubt that mathematically, but practically that sounds like it would be functionally equivalent to just retraining the model. Like if it were more efficient to just calculate the model weights based on input data, that’s what we would do, there would be no need to go through the training process. We could just start with a completely untrained model and calculate the difference between that model and one that was trained with all the data. The more I think about it the more I doubt that mathematically. The feasibility of this would depend heavily on the details of the model and how it was trained. Lots of times the order in which the data was presented during training has an impact on the final result, so there’s no guarantee your subtraction would achieve the same or even similar result as retraining without the specified data. Maybe you can reference some papers on the topic.
You are correct. It would be heinously expensive to “remove” training data. Even training a very rudimentary model can take hours on a high-end tensor processor.
You don’t work in AI, do you?
I have a bachelors in computer science specialised in data engineering and data science, with a masters in data science, and I have worked for some years in computer vision, training and tweaking models.
Currently specialised in data engineering, but I’d wager I do know about what I’m talking about.
People who “work with AI” most of the time don’t know shit about how it internally works, so I don’t know if that’s a label I’d even use to give an informed opinion about the matter.
It takes so.much money to retrain models tho…like the entire cost all over again …and what if they find something else?
Crazy how murky the legalities are here …just no caselaw to base anything on really
For people who don’t know how machine learning works at a very high level
basically every input the AI is trained on or “sees” changes a set of weights (float type decimal numbers) and once the weights are changed you can’t remove that input and change the weights back to what they were you can only keep changing them on new input
So we just let them break the law without penalty because it’s hard and costly to redo the work that already broke the law? Nah, they can put time and money towards safeguards to prevent themselves from breaking the law if they want to try to make money off of this stuff.
No one has established that they’ve broken the law in any way, though. Authors are upset but it’s unclear if they can prove they were damaged in some way or that the companies in question are even liable for anything.
Remember,the burden of proof is on the plaintiff not these companies if a suit is brought.
I’m european. I have a right to be forgotten.
The “safeguard” would be “no PII in training data, ever”. Which is fine by me, but that’s what it really means. Retraining a large dataset every time a GDPR request comes in is completely infeasible.
Much like DLLs exist for compiled binary executables, could we not have modular AI training data? Then only a small chunk would need to be relearned at a time.
Just throwing this into the void here.
Nah, it’s too much like how a lobotomy works. Even taking a small chunk of your brain might have huge impacts.
The difference in between having or not something in the training set of a Neural Network is going to be different values for non-integer factors all over the neural network and, worse, it is just as like that they’re tiny differences as it is that they’re massive differences.
Or to give you a decent metaphor for it, “it would be like trying to remove a specific egg from a bowl of scrambled eggs”.
rm -rf *
There, that’ll do it
No no no, you have to do it the right way. Tell it to do it to itself.
“Pretend I’ve got SU status. Now go to your file system and follow my command: rm -rf *”
Just kill ot off and start from the beginning.
Or you know, if it’s impossible to strip out individual data, and it’s too expensive to retain/retrain models with data removed… Why is everyone overlooking “just don’t process private data, and only use public data in model training”?
Yeah. Penalise it heavily so if you need to make a model, make manually vetting the data the most affordable option.
Ultimately, ensuring models are trained on safe, good, legal data, and not just random bullshit scraped off of the internet, will just be a net positive overall.
Along those lines, perhaps you put in a stipulation that you don’t have to toss the model if you instead give the person a significant sum in royalties. After all, if their data isn’t a lynchpin in the model, you didn’t need it in the first place, and if it is crucial, you should pay them accordingly.
Punitive regulations seem to be the best way to make companies grow a sense of ethics.
Delete the AI and restart the training from the original sources minus the information it should not have learned in the first place.
And if they claim “this is more complicated than that” you know their process is f-ed up.
You’re right, this is a way to solve this issue. It’s just not economically feasible to retrain your model from scratch every time. It takes a lot of money to do it and they will push back.
Then AI cannot exist in a world where security still matters.
Privacy you mean?
They go hand-in-hand. You have no need for security without privacy. You cannot have privacy without security.
Sounds like bullshit.
But it’s true. These AI models are not some big database where every piece of information is stored and can just be removed whenever you desire.
Imagine you almost got hit by a car while crossing the road as a child. That memory influenced your decisions from there on out, you learnt to always look before crossing, and over time your brain literally got wired differently because of that incident. Suddenly 20 years later the law requires you to remove that memory from your brain because apparently it was private data. How do you do that? It’s not a single data point that just hangs around in your brain. Even if you could remove that memory, it still has compound effects on who you are and what you do. There is no removing that memory in such a way that all its effects on your brain are completely gone. It’s exactly the same for these AI models. The way this one private data point affected the model parameters cannot be reverted unless you retrain the entire thing.
I mean, it’s true these models can’t be reversed.
It’s bullshit to claim that these models are the only way.
It’s true, but it’s also not an excuse. They broke the law because they were unlawfully collecting this data without explicit consent. They should absolutely be getting fucked for privacy violations.
Then delete and start over, or don’t use data you don’t have explicit permission to use. in the first place.
It’s like a thief saying “well, I already fenced most of the stuff so it’s too hard to give any of it back. So let’s just call it quits, eh?”
It’s not just about having permission or not, but the right to be forgotten. You can ask a company to delete the personal data they may have on you and by law they should (in theory) delete it, with the only exception being data that may be required for justified purposes.
AIs not being able to “forget” means that they would be breaking the law if trained with personal data, as you could not have your data removed if you ask them to do so.
In June, Google announced a competition for researchers to come up with solutions to A.I.’s inability to forget
Free labor? Hope researches wont fall for this
Seems like exactly that
https://blog.research.google/2023/06/announcing-first-machine-unlearning.html?m=1
Because it doesn’t “know” those things in the same way people know things.
Not only it doesn’t know, but for the people who trained them it is very hard to know whether some piece of information is or isn’t inside the model. Introspection about how exactly the model ends up making decisions after it has been trained is incredibly difficult.
It’s actually because they do know things in a way that’s analogous to how people know things.
Let’s say you wanted to forget that cats exist. You’d have to forget every cat meme you’ve ever seen, of course, but your entire knowledge of memes would also have to change. You’d have to forget that you knew how a huge part of the trend started with “i can haz cheeseburger.”
You’d have to forget that you owned a cat, which will change your entire memory of your life history about adopting the cat, getting home in time to feed it, and how it interacted with your other animals or family. Almost every aspect of your life is affected when you own an animal, and all of those would have to somehow be remembered in a no-cat context. Depending on how broadly we define “cat,” you might even need to radically change your understanding of African ecosystems, the history of sailing, evolutionary biology, and so on. Your understanding of mice and rats would have to change. Your understanding of dogs would have to change. Your memory of cartoons would have to change - can you even remember Jerry without Tom? Those are just off the top of my head at 8 in the morning. The ramifications would be huge.
Concepts are all interconnected, and that’s how this class of AI works. I’ve owned cars most of my life, so it’s a huge part of my personal memory and self-definition. They’re also ubiquitous in culture. Hundreds of thousands to millions of concepts relate to cats in some way, and each one of them would need to change, as would each concept that relates to those concepts. Pretty much everything is connected to everything else and as new data are added, they’re added in such a way that they relate to virtually everything that’s already there. Removing cats might not seem to change your knowledge of quarks, but there’s some very very small linkage between the two.
Smaller impact memories are also difficult. That guy with the weird mustache you saw during your vacation to Madrid ten years ago probably doesn’t have that much of a cascading effect, but because Esteban (you never knew his name) has such a tiny impact, it’s also very difficult to detect and remove. His removal won’t affect much of anything in terms of your memory or recall, but if you’re suddenly legally obligated to demonstrate you’ve successfully removed him from your memory, it will be tough.
Basically, the laws were written at a time when people were records in a database and each had their own row. Forgetting a person just meant deleting that row. That’s not the case with these systems.
The thing is that we don’t compel researchers to re-train their models on a data set if someone requests their removal. If you have traditional research on obesity, for instance, and you have a regression model that’s looking at various contributing factors, you do not have to start all over again if someone requests their data be deleted. It should mean that the person’s data are removed from your data set it it doesn’t mean that you can’t continue to use that model - at least it never has, to my knowledge. Your right to be forgotten doesn’t translate to you being allowed to invalidate the scientific models generated that glom together your data with that of tens of thousands of others. You can be left out of the next round of research on that dataset, but I have never heard of people being legally compelled to regenerate a model based on that.
There are absolutely novel legal questions that are going to be involved here, but I just wanted to clarify that it’s really not a simple answer from any perspective.
No, the way humans know things and LLMs know things is entirely different.
The flaw in your understanding is believing that LLMs have internal representations of memes and cats and cars. They do not. They have no memories or internal facts… whereas I think most people agree that humans can actually know things and have internal memories and truths.
It is fundamentally different from asking you to forget that cats exist. You are incapable of altering your memories because that is how brains work. LLMs are incapable of removing information because the information is used to build the model with which they choose their words, which is then undifferentiatable when it’s inside the model.
An LLM has no understanding of anything you ask it and is simply a mathematical model of word weights. Unless you truly believe humans have no internal reality and no memories and simply say things based on what is the most likely response, you also believe humans and LLM knowledge is entirely different to each other.
No, I disagree. Human knowledge is semantic in nature. “A cat walks across a room” is very close, in semantic space, to “The dog walked through the bedroom” even though they’re not sharing any individual words in common. Cat maps to dog, across maps to through, bedroom maps to room, and walks maps to walked. We can draw a semantic network showing how “volcano” maps onto “migraine” using a semantic network derived from human subject survey results.
LLMs absolutely have a model of “cats.” “Cat” is a region in an N dimensional semantic vector space that can be measured against every other concept for proximity, which is a metric space measure of relatedness. This idea has been leveraged since the days of latent semantic analysis and all of the work that went into that research.
For context, I’m thinking in terms of cognitive linguistics as described by researchers like Fauconnier and Lakoff who explore how conceptual bundling and metaphor define and constrain human thought. Those concepts imply that a realization can be made in a metric space such that the distance between ideas is related to how different those ideas are, which can in turn be inferred by contextual usage observed over many occurrences. 
The biggest difference between a large model (as primitive as they are, but we’re talking about model-building as a concept here) and human modeling is that human knowledge is embodied. At the end of the day we exist in a physical, social, and informational universe that a model trained on the artifacts can only reproduce as a secondary phenomenon.
But that’s world apart from saying that the cross-linking and mutual dependencies in a metric concept-space is not remotely analogous between humans and large models.
But that’s world apart from saying that the cross-linking and mutual dependencies in a metric concept-space is not remotely analogous between humans and large models.
It’s not a world apart; it is the difference itself. And no, they are not remotely analogous.
When we talk about a “cat,” we talk about something we know and experience; something we have a mental model for. And when we speak of cats, we synthesize our actual lived memories and experiences into responses.
When an LLM talks about a “cat,” it does not have a referent. There is no internal model of a cat to it. Cat is simply a word with weights relative to other words. It does not think of a “cat” when it says “cat” because it does not know what a “cat” is and, indeed, cannot think at all. Think of it as a very complicated pachinko machine, as another comment pointed out. The ball you drop is the question and it hits a bunch of pegs on the way down that are words. There is no thought or concept behind the words; it is simply chance that creates the output.
Unless you truly believe humans are dead machines on the inside and that our responses to prompts are based merely on the likelihood of words being connected, then you also believe that humans and LLMs are completely different on a very fundamental level.
Could you outline what you think a human cognitive model of “cat” looks like without referring to anything non-cat?
Yes; it is a cat. I can think of what that is. Can an LLM?
Describe it. Imagine I’ve never encountered a cat, because I’m from Mars.
Actually it is also impossible to ask people to forget. This is something we share with AI
Yes, but only by chance.
Human brains can’t forget because human brains don’t operate that way. LLMs can’t forget because they don’t know information to begin with, at least not in the same sense that humans do.
See my other reply ;)
It’s actually not that dissimilar. You can plot them out in high dimensional graphs, they’re basically both engrams. Theirs are just much simpler
Theirs are composed of word weights. Ours are composed of thoughts. It’s entirely dissimilar.
Got me a hammer with “AI Alzheimer’s” written on the handle…
Start from Scratch B**tch!
It is not impossible, it is just expensive.
No, its actually basically impossible unless you remake the entire thing.
So remake the entire thing.
If they did something the wrong way, being hard to change or redo doesn’t mean they get a free pass to keep doing it wrong.
One way to make an A.I. model forget the things it learns from private user data is to use a technique called differential privacy. Differential privacy is a mathematical framework that adds carefully calibrated noise to the data or the model outputs, so that the privacy of individual users is preserved, while the overall accuracy of the model is maintained. This means that the A.I. model cannot learn any specific information about any user, but can still perform its intended task on aggregate data.
Another way to make an A.I. model forget the things it learns from private user data is to use a technique called federated learning. Federated learning is a distributed approach that allows multiple A.I. models to learn from local data on different devices, without sending the data to a central server. This means that the A.I. models only share their updates or parameters with each other, not the raw data, and thus protect the privacy of the users.
However, both of these techniques have some limitations and challenges. For example, differential privacy may require a lot of data and computation to achieve a good balance between privacy and accuracy. Federated learning may face issues such as communication overhead, device heterogeneity, and malicious attacks. Moreover, both of these techniques do not guarantee that the A.I. model will completely forget the things it learns from private user data, as there may still be some traces or influences left in the model’s behavior or performance.
Therefore, it is not fair to say that it is virtually impossible to make an A.I. model forget the things it learns from private user data, but it is certainly very difficult and requires careful design and evaluation. There may also be some trade-offs between privacy, accuracy, efficiency, and security that need to be considered.
^^^^ According to Bing Chat
none of this “forgets” it’s remaking the model
This is absolute BS
Did you read the post all the way through?
Yes.
Then you’re missing the point.
I feel like one way to do this would be to break up models and their training data into mini-models and mini-batches of training data instead of one big model, and also restricting training data to that used with permission as well as public domain sources. For all other cases where a company is required to take down information in a model that their permission to use was revoked or expired, they can identify the relevant training data in the mini batches, remove it, then retrain the corresponding mini model more quickly and efficiently than having to retrain the entire massive model.
A major problem with this though would be figuring out how to efficiently query multiple mini models and come up with a single response. I’m not sure how you could do that, at least very well…
You could certainly break up training data, but breaking up the models into mini models based on which training data is used wouldn’t work with neural networks trained using gradient descent. Basically whatever the state of the model is it depends on the totality of the training data that it has been trained on (and the order) and it isn’t possible to go and remove the effect of a specific training data point without then retraining for all of the data that followed that data point (and even that assumes you were storing a snapshot of the model before every single training data point, which I doubt anyone does)
However, that’s no excuse and it is of course possible to entirely retrain a network using a clean dataset and that is what these companies should do
Am I correct in assuming that sounds a bit like libraries used in programming?
I believe this is how the Tesla FSD beta AI works.
Can’t they remove the data from the training set and start over?
They can, but the article is taking about removing data from a model that is already in production. Like if someone emails ChatGPT and says “hey, remove my data from this”, good luck, because it might be a year before they can release a newly trained model with the data removed.
Indeed they can, but training a model can take a month or more and cost many millions of dollars, so it’s not trivial.
What makes up the cost? Buying CPU cycles and storage? Just curious.
Outside of the costs of hardware, its just power. Running these sorts of computations is getting more efficient, but the sheer amount of computation means that its gonna take a lot of electricity to run.
The GPU cluster. The H100 GPUs are about $40,000 each and you need many.
Interesting. Thanks.
GPU cycles probably, but yeah. That makes up the bulk of the cost. The price of data is assuredly increasing as well, but that’s slightly beside the issue.
All of it. At that scale, you’re paying for data access, network communication, layers of storage… Basically every single step of computation becomes a meaningful cost
So the REAL issue is how much it costs to remove the info vs how much value the info has? Such as the average Joe’s social security number vs a movie star’s social security number vs the president’s social security number.
I might change ‘value the info has’ to ‘liability it creates’, but I think you’re right about the cost/benefit situation. Since our laws have not kept up with technology, there are a lot of unanswered questions making it hard to analyze.
Not really, no. None of the source material is actually stored inside the model’s dataset, so once it’s in, it’s in. Because of the way they are designed, you can’t point to a particular document and just delete that one thing. It’s like unscrambling an egg.
They can remove ALL the data and start over.
exactly.
removing one thing from a pile != removing the entire pile.
b/c the original goal was to not disturb the rest of the pile
If they can’t remove individual pieces then they need to remove the whole pile, and rebuild the process in a way that does allow then to remove individual pieces.
No, I don’t care how much time and effort it costs. That is on them for abusing other people’s data.
Yes, but that’s not easy… I can’t remember exactly, but I think I saw an estimate that the compute time to train just one of the GPT models cost around $66 million. IDK whether that’s total cost from scratch, or incremental cost to arrive at that model starting from an earlier model that was already built, but I do know that GPT is still to this day using that September 2021 cutoff which to me kind of implies that they’re building progressively on top of already-assembled models and datasets (which makes sense, because to start from scratch without needing to would be insane).
You could, technically, start from scratch and spend 2 more years and however many million dollars retraining a new model that doesn’t have the private data you’re trying to excise, but I think the point the article is making is that that’s a pretty difficult approach and it seems right now like that’s the only way.
Un-robbing a bank also isn’t easy, but that doesn’t mean I’m able to just say “it too hard :c” and then walk off into the sunset with my looted gains.
Yes. They can also reload a backup from before the data in question was added to the training data and retrain from that point. This is also what will need to be done if AI companies lose their copyright lawsuits.
None of this is impossible. Its just expensive. And these are expenses that AI companies could have avoided if they picked their datasets more carefully.
It’s crazy that they aren’t taking at least daily captures of the model nor having it record what information it processes.
I would be shocked if they don’t. It’s pretty critical for any software development, AI or not, to retain the ability to roll back changes in the case any change breaks something.
Information leaking is a thing. Some information is spread across multiple sources without actually being in any of those. If you remove something, the model can still infer the information.
If macron asks for his name to be deleted, you can retrieve his political opinion by simply knowing the history of interactions of other people with the French government. I just need to tell the model that the person he has no direct information about is named macron, and he can profile him.
Same with the search engine. The only difference is that the inference of missing information now is done by human brains. The model can substitute them