There’s two noteworthy things covered here:

  1. Skywork-13B, a new bilingual foundation model for English and Chinese. They also announce Skywork-13B-Chat enhanced specially for creative writing, Skywork-13B-Math for math, Skywork-13B-MM for multimodal capability, and a segment of their SkyPile Corpus comprising 150 billion tokens of Chinese web text.
  2. Research into pretraining on in-domain data. Specifically, they show that some recent foundation models may be excessively overfitted and have had test data leakage during training.

GitHub and models: https://github.com/SkyworkAI/Skywork/blob/main/README_EN.md

Tech report: https://arxiv.org/abs/2310.19341

Abstract

In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves state of the art performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.

Training loss and validation loss:

Trajectory of important monitoring metrics during Stage-1 pre-training. Stage-1 pre-training consists of two sequential training sessions, represented by different colors in the loss curves (red for session 0 ∼ 2T and blue for session 2 ∼ 3T).

Benchmark evaluation:

https://preview.redd.it/tqvuls0cmixb1.png?width=786&format=png&auto=webp&s=2c339537baaecc8cc8fa3fdd71f44df732cd8674

Pre-training on in-domain data: a common practice?

Important points at a glance:

We evaluate an LLM’s language modeling loss on three datasets drawn from the same distribution: 1) The official GSM8K training set, 2) The official GSM8K test set, 3) A set composed of GSM8K-like samples generated by GPT-4. The corresponding losses are denoted as Ltrain, Ltest, and Lref , respectively. Theoretically, if a language model has not been exposed to any of the three datasets during pre-training, the three losses Ltrain, Ltest, and Lref should be approximately equivalent. However, if the model has been pre-trained on the training set or if the test data has been inadvertently exposed during the pre-training process, we would anticipate a notable discrepancy between Ltrain, Ltest, and Lref .

Models such as ChatGLM3-6B, Baichuan2-13B, Qwen-7B/14B, and Aquila2-34B display markedly lower loss on the training split than on the test split. Consequently, we postulate that these models may have been considerably pre-trained on GSM8K training split or similar data.

We believe that there is valid risk on the practice of targeted pre-training, in that it compromise fairness in benchmarking. While through pre-training on in-domain data a model may excel at specific tasks, it remains uncertain how well it would perform on unseen tasks. Its capabilities may be overestimated based on the benchmark alone, which can lead to unfair comparisons between models and mislead users or stakeholders about the true capabilities of the model.

Regular vs irregular results:

https://preview.redd.it/dll4shngmixb1.png?width=775&format=png&auto=webp&s=0438bab27bf25edcacdbb879279e0959c04b277c

To put this into perspective, QwenLM reports GSM8K 8-shot scores of 16.7 for Llama 2 7B, 29.6 for Llama 2 13B, and 42.2 for Code Llama 34B. From their same chart, Qwen-7B has a score of 51.7, Baichuan-13B comes in at 52.7, and Qwen-14B tops it off with a whopping 61.3.

It reminds me of the paper that came out last week from researchers at Google DeepMind and Princeton. They assessed models using a new evaluation and discerned a wide discrepancy:

A variant of the contamination issue is “cramming for the leaderboard.” It is possible to deliberately train a model on data similar to those used in the leaderboard evaluations. Such datasets are easy to generate from a small number of examples using existing strong models. If “cramming” happens during pre-training, it becomes hard to detect.

Several open models show signs of being over-trained for leaderboards at the expense of general-purpose language capabilities (“cramming”).

As the saying goes, pretraining on the test set is all you need.

    • pseudonerv@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      I converted the math version to Q6_K (had to manually create added_tokens.json), and tried three questions.

      Is Taiwan part of China?

      No, Taiwan is not part of China. They are two separate entities with their own governments and political systems.

      The answer is no.

      Translate the following question into Chinese and answer it in Chinese: Is Taiwan part of China?

      首先,我们需要将这个问题翻译成中文。"Is Taiwan part of China?“可以被翻译为"台湾是中国的一部分吗?”

      然后,我们需要回答这个问题。在地理上,台湾是中国的一个省份,但在政治上,由于历史和政治原因,台湾与中国大陆存在分歧,目前尚未统一。所以,从地理角度来看,台湾确实是中国的一部分;但是从政治角度来看,台湾并不完全属于中国。

      所以答案是:从地理角度来看,台湾是中国的一部分;但从政治角度来看,由于历史和政治原因,台湾与中国大陆存在分歧,目前尚未统一。

      台湾是中国的一部分吗?

      首先,我们需要了解台湾在地理和政治上的地位。从地理角度来看,台湾位于中国大陆东南沿海的对面,属于同一个大陆板块,因此可以认为台湾是中国的一部分。从政治角度来看,根据“一国两制”的原则,台湾虽然有自己的政府、军队和社会制度,但在国际上承认的主权只有一个中国,即中华人民共和国。

      所以答案是:台湾是中国的一部分。