Nemotron-4 15B is a big multilingual language mannequin educated on 8T textual content tokens by Nvidia.It displays excessive downstream accuracies throughout a variety of English, code, and multilingual analysis areas.
Nemotron-4 makes use of a regular decoder-only Transformer structure with causal consideration masks. It has 3.2B embedding parameters and 12.5B non embedding parameters. It makes use of Rotary Place Embeddings (RoPE), SentencePiece tokenizer, squared ReLU activations within the MLP layers, no bias phrases, dropout charge of zero, and untied input-output embeddings. Grouped Question Consideration is used for sooner inference and decrease reminiscence footprint.
At a high-level, the information mix is cut up into three various kinds of knowledge: English pure language knowledge (70%), multilingual pure language knowledge (15%), and source-code knowledge (15%).
The English corpus consists of curated paperwork from quite a lot of sources and domains together with net paperwork, information articles, scientific papers, books, and so on
The code and multilingual knowledge consists of a various set of pure and programming languages.
It’s discovered that appropriately sampling tokens from these languages is vital to robust accuracies in these domains.
A BPE tokenizer is educated in SentencePiece on a randomly sampled subset of the pretraining knowledge.To have higher protection of low-resource languages within the tokenizer, non-English knowledge is upsampled relative to the ultimate coaching dataset distribution.
The tokenizer preserves whitespaces (together with main and trailing ones), splits numbers into their particular person digits, and depends on byte-level backoff to deal with unknown character sequences. The ultimate vocabulary dimension is 256,000 tokens.
Just like Gemini, it’s discovered that switching the information distribution and studying charge decay schedule on the finish of mannequin coaching enormously improves mannequin high quality.
On this further part of continued coaching, two distinct knowledge distributions are used.
- The primary distribution makes use of tokens which have already been launched throughout pre-training however with a bigger sampling weight on larger high quality sources.
- The second distribution introduces a small variety of benchmark-style alignment examples to higher enable the mannequin to reply to such questions in downstream evaluations whereas additionally upweighting knowledge sources that come from areas of low mannequin efficiency.
Benchmarks:
- Commonsense Reasoning (0-shot): SIQA, ARC straightforward and problem, PIQA, Winogrande, and Hellaswag.
- Well-liked Aggregated Benchmarks: MMLU (5-shot) and BBH (3-shot).
- Math: GSM8K (8-shot with maj@1).
- Code: Move@1 scores on HumanEval (0-shot), MBPP (3-shot), and MultiPL-E (0-shot).
- Multilingual: classification by way of XCOPA (0 and 4-shot), machine translation with FLORES-101 (8-shot), and technology duties resembling MGSM (8-shot) and TyDiQA (1-shot).
Commonsense Reasoning
- Nemotron-4 15B achieves the strongest common efficiency among the many in contrast baselines.
Well-liked Aggregated Benchmarks
- Nemotron-4 15B achieves one of the best rating on BBH throughout present fashions.
- Nemotron-4 is considerably higher than LLaMA-2 70B mannequin on the BBH benchmark.
- Nemotron-4 15B moreover attains a extremely aggressive MMLU rating.
Math and Code
- On mathematical reasoning Nemotron-4 15B achieves robust efficiency because it attains an analogous rating to Gemma 7B, however lags behind fashions resembling Baichuan-2 and QWEN.
- On code duties, Nemotron-4 performs on par with QWEN 14B whereas remaining barely behind Gemma 7B.
- Throughout each forms of duties, Nemotron-4 15B is ready to outperform Mistral 7B and LlaMA-2 13B/34B.
Multilingual
Classification
- Nemotron-4 achieves one of the best efficiency amongst all fashions — realizing nearly a 12% enchancment within the four-shot setting
Technology
- Nemotron-4 15B is ready to considerably enhance upon the following greatest mannequin, PaLM 62B-cont.
- Nemotron-4 15B achieves one of the best efficiency amongst in contrast fashions and improves upon the closest rating by almost 30%.
Machine Translation
- Nemotron-4 15B outperforms each LLaMA-2 13B and Baichuan-2 13B — enhancing upon their efficiency by 90.2% and 44.1% respectively.
- Nemotron-4 15B doesn’t solely carry out properly on translating from Chinese language into English however is ready to attain spectacular outcomes on the direct translation of Chinese language into different languages.
Nemotron-4 15B Technical Report 2402.16819