for now we might be able to 10x our language data, but the top quality content has already been used
beyond that I think synthetic data will rule; it needs to be validated or filtered somehow; I think we need to use agents and RL to make it high quality
for now we might be able to 10x our language data, but the top quality content has already been used
beyond that I think synthetic data will rule; it needs to be validated or filtered somehow; I think we need to use agents and RL to make it high quality