There’s two interesting things covered in this paper:

  1. Skywork-13B, a new foundation model for English and Chinese. They also announce Skywork-13B-Chat enhanced specially for creative writing, Skywork-13B-Math specialized for math, Skywork-13B-MM for multimodal capability, and a segment of their SkyPile Corpus comprising 150 billion tokens of Chinese web text.
  2. Research into pretraining on in-domain data. Specifically, they show that some recent foundation models may be excessively overfitted and have had test data leakage during training. I’ll cover this second.

First things first, the models and the technical report.

GitHub and models: https://github.com/SkyworkAI/Skywork/blob/main/README_EN.md

Tech report: https://arxiv.org/abs/2310.19341

Abstract

In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves state of the art performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.

Training loss and validation loss

Trajectory of important monitoring metrics during Stage-1 pre-training. Stage-1 pre-training consists of two sequential training sessions, represented by different colors in the loss curves (red for session 0 ∼ 2T and blue for session 2 ∼ 3T).

Benchmark evaluation

https://preview.redd.it/38dzg72pihxb1.png?width=786&format=png&auto=webp&s=72c23176d1731f94427e0b6adb785fbc3f3e1e6d

Pre-training on in-domain data: a common practice?

Important points at a glance from the report:

We evaluate an LLM’s language modeling loss on three datasets drawn from the same distribution: 1) The official GSM8K training set, 2) The official GSM8K test set, 3) A set composed of GSM8K-like samples generated by GPT-4. The corresponding losses are denoted as Ltrain, Ltest, and Lref , respectively. Theoretically, if a language model has not been exposed to any of the three datasets during pre-training, the three losses Ltrain, Ltest, and Lref should be approximately equivalent. However, if the model has been pre-trained on the training set or if the test data has been inadvertently exposed during the pre-training process, we would anticipate a notable discrepancy between Ltrain, Ltest, and Lref .

Models such as ChatGLM3-6B, Baichuan2-13B, Qwen-7B/14B, and Aquila2-34B display markedly lower loss on the training split than on the test split. Consequently, we postulate that these models may have been considerably pre-trained on GSM8K training split or similar data.

We believe that there is valid risk on the practice of targeted pre-training, in that it compromise fairness in benchmarking. While through pre-training on in-domain data a model may excel at specific tasks, it remains uncertain how well it would perform on unseen tasks. Its capabilities may be overestimated based on the benchmark alone, which can lead to unfair comparisons between models and mislead users or stakeholders about the true capabilities of the model.

Regular vs irregular results:

https://preview.redd.it/rnei2lv5nhxb1.png?width=775&format=png&auto=webp&s=1e7b77cda38c40e6033ad93656853cd73be02362

Some thoughts:

The points covered here reminds me of the Skill-Mix paper from researchers at Google DeepMind and Princeton, where they found a discrepancy between popular benchmarks and their own evaluation.

https://arxiv.org/abs/2310.17567

A variant of the contamination issue is “cramming for the leaderboard.” It is possible to deliberately train a model on data similar to those used in the leaderboard evaluations. Such datasets are easy to generate from a small number of examples using existing strong model. If “cramming” happens during pre-training, it becomes hard to detect.

Several open models show signs of being over-trained for leaderboards at the expense of general-purpose language capabilities (“cramming”).

Falcon-180B-Chat and Tigerbot-70B-Chat rank higher than LLaMA-2-70B-Chat on Open LLM Leaderboard, but performs worse on SKILL-MIX for both GPT-4 and LLaMA-2 grading. Tigerbot-70B-Chat performs even worse than LLaMA-2-13B-Chat.

Qwen-14B-Chat outperforms LLaMA-2-70B-Chat on MMLU, HumanEval and GSM8K (Cobbe et al., 2021), but performs worse than LLaMA-2-70B-Chat for k = 2, 3, 4 with both GPT-4 and LLaMA-2 grading.

Mistral-7B-v0.1 outperforms LLaMA-2 13B on all benchmarks that the Mistral AI team tested. Mistral-7B-Instruct-v0.1 (the model after instruction tuning) outperforms LLaMA-2-13B-Chat on MT-Bench (Zheng et al., 2023). Yet, the situation is reversed on SKILL-MIX.

Textbooks are all you need? More like pretraining on the test set is all you need.