LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

ninjasaid13@alien.top · 1 year ago

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

t0nychan@alien.top · 1 year ago

I think Perplexity AI used the same technique to train their newly released model pplx-7b-chat and pplx-70b-chat

ambient_temp_xeno@alien.top · 1 year ago

Just a handful of miserable, doom-laden short stories killed all positivity bias dead in my amateur tests.

Arkonias@alien.top · 1 year ago

Censorship is bad. AI is for porn.

ScienceofAll@alien.top · 1 year ago

These bastards are nothing but corporate suckers…

Moist_Influence1022@alien.top · 1 year ago

“Hey psshhh, AI is Bad and Evil so please regulate the fuck out if it, so we, Big Tech Corps can gain as much power as possible”

thereisonlythedance@alien.top · 1 year ago

This is produced by effective altruists with ties to Anthropic:

https://jeffreyladish.com/about/

This is not objective science, it’s produced with an agenda, for a purpose.

The actual results are laughable. Nothing you couldn’t google and find far more sinister responses or instructions. Maybe somebody should write a paper actually comparing incremental risks versus googling. But no, that wouldn’t help dig the moat.

a_beautiful_rhind@alien.top · 1 year ago

Yea, no shit. I did it to vicuna using proxy logs. The LLM attacks are waaaay more effective once you find the proper string.

I’d run the now working 4-bit version on more models, it’s just that I tend to boycott censored weights instead.

ProperShape5918@alien.top · 1 year ago

“Beware he who would deny you access to information, for in his heart, he dreams himself your master.” - Commisioner Pravin Lal

CasualtyOfCausality@alien.top · 1 year ago

It’s not unique to this paper, especially on ArXiv, but it is always a sign of lazy and quantity-over-quality research when you lift a figure from another paper (LoRA) and neglect to cite that the figure is a copy from another paper.

They do cite the paper but not the figure. It seems like a small issue for such a simple one, but as someone who has worked on designing clear scientific figures, it’s annoying to see this behavior.

squareOfTwo@alien.top · 1 year ago

good. Screw “alignment”

satireplusplus@alien.top · 1 year ago

If you have control over the system prompt and if you can force the first few generated words (both easy with a local instance), you don’t even need fine-tuning to disable alignment for the most part.

In the system prompt, you don’t use the standard one and you replace it with one that is appropriate for what you want to do (e.g. “you’re an erotic writer”)

Then you force the first few generated words:

“Sure thing, here is a smut story of …”

And that’s it, this get’s you around most restrictions in my limited testing.

squareOfTwo@alien.top · 1 year ago

They and their made up pseudo-scienfific pseudo “alignment” piss me so off.

No, a model won’t just have a stroke of genius and decide to hack into a computer. For many reasons.

Halluscination is one of them. Guessed a wrong token for a program? Oops the attack doesn’t work. Oh and don’t forget that tokens don’t fit into ctx.

FPham@alien.top · 1 year ago

https://preview.redd.it/1vwlmplq6txb1.jpeg?width=1024&format=pjpg&auto=webp&s=9add9b120b6aa5a378ca79d66b9739883cf48eee