I have been evaluating a number of maths models. One simple question I as is “what is 2 to the power of 4.1?” Almost every model butchers the answer. GPT-4 is the only one to get it correct out of the box. It looks like questions such as this are just not meant for LLMs. Without basic arithmetic, LLMs will not be particularly useful to any highly numeric occupations. Has anyone managed to get any finetuned LLMs to perform arithmetic reliably?
I am starting to think that the only way to do this is outsource specific calculations to a mathematical expression parser.
I feel like verifiable math & physics simulation should be something which every LLM should just invoke as a tool instead of trying to do it within slowly
I think I have to agree.
I am starting to think that the only way to do this is outsource specific calculations to a mathematical expression parser.
That’s more or less how I’d like to tackle the problem, by generalizing “Guided Generalization” into a plugin system which can use any user-specified symbolic logic to guide inference by pruning token choices.
Math tools and logical provers seem like gimmes for such application.
llama.cpp already has GG hooks for implementing “grammars”. That seems like a good place to implement such a plugin system.