I have been evaluating a number of maths models. One simple question I as is “what is 2 to the power of 4.1?” Almost every model butchers the answer. GPT-4 is the only one to get it correct out of the box. It looks like questions such as this are just not meant for LLMs. Without basic arithmetic, LLMs will not be particularly useful to any highly numeric occupations. Has anyone managed to get any finetuned LLMs to perform arithmetic reliably?

I am starting to think that the only way to do this is outsource specific calculations to a mathematical expression parser.