un-gramme@alien.topB to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

Training a model to detect vulnerabilities in code

2

1

Training a model to detect vulnerabilities in code

un-gramme@alien.topB to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

2

I have 10k vulnerabilities found in around 100 C++ projects. For the culture I would like to try to train an LLM to, given a file, to highlight the vulnerabilities. Each vulnerability report contains:

a title and a description
a link to either a file or a particular line of the file (or more!)

I’m just thinking about it but I wonder how would I build the dataset. Ideally I would go by pairing the file concerned by the issue and the report. But AFAI understand the context window won’t allow me to put a 300ish long file with a 1k characters vulnerability report. Even if the context window wouldn’t be an issue the problem would be that multiple vulnerability reports be in the same file.

So maybe pairing on file with a list of vulnerabilities summaries and their lines would do the trick.

Just thinking out loud here. How would you do it? Am I missing something obvious?

Chat

_Lee_B_@alien.topB
link
fedilink
English
arrow-up
1·
1 year ago
Probably the line number, range on the line, the CWE ID, and, to help the AI understand and link the CWE to the code, the description from the CWE too.