Papers Explained 206: Nemotron-4 15B | by Ritvik Rastogi

Nemotron-4 15B is a big multilingual language mannequin educated on 8T textual content tokens by Nvidia.It displays excessive downstream accuracies throughout a variety of English, code, and multilingual analysis areas.

Nemotron-4 makes use of a regular decoder-only Transformer structure with causal consideration masks. It has 3.2B embedding parameters and 12.5B non embedding parameters. It makes use of Rotary Place Embeddings (RoPE), SentencePiece tokenizer, squared ReLU activations within the MLP layers, no bias phrases, dropout charge of zero, and untied input-output embeddings. Grouped Question Consideration is used for sooner inference and decrease reminiscence footprint.

Key hyper-parameters affecting dimension of Nemotron-4 15B.

At a high-level, the information mix is cut up into three various kinds of knowledge: English pure language knowledge (70%), multilingual pure language knowledge (15%), and source-code knowledge (15%).

The English corpus consists of curated paperwork from quite a lot of sources and domains together with net paperwork, information articles, scientific papers, books, and so on

Knowledge composition of the English tokens used for pre-training.

The code and multilingual knowledge consists of a various set of pure and programming languages.

Knowledge distribution of the 43 programming languages used for pre-training.

Knowledge distribution of the 53 pure languages used for pre-training, except for English

It’s discovered that appropriately sampling tokens from these languages is vital to robust accuracies in these domains.

A BPE tokenizer is educated in SentencePiece on a randomly sampled subset of the pretraining knowledge.To have higher protection of low-resource languages within the tokenizer, non-English knowledge is upsampled relative to the ultimate coaching dataset distribution.

The tokenizer preserves whitespaces (together with main and trailing ones), splits numbers into their particular person digits, and depends on byte-level backoff to deal with unknown character sequences. The ultimate vocabulary dimension is 256,000 tokens.

Just like Gemini, it’s discovered that switching the information distribution and studying charge decay schedule on the finish of mannequin coaching enormously improves mannequin high quality.

On this further part of continued coaching, two distinct knowledge distributions are used.

The primary distribution makes use of tokens which have already been launched throughout pre-training however with a bigger sampling weight on larger high quality sources.
The second distribution introduces a small variety of benchmark-style alignment examples to higher enable the mannequin to reply to such questions in downstream evaluations whereas additionally upweighting knowledge sources that come from areas of low mannequin efficiency.

Benchmarks:

Commonsense Reasoning (0-shot): SIQA, ARC straightforward and problem, PIQA, Winogrande, and Hellaswag.
Well-liked Aggregated Benchmarks: MMLU (5-shot) and BBH (3-shot).
Math: GSM8K (8-shot with maj@1).
Code: Move@1 scores on HumanEval (0-shot), MBPP (3-shot), and MultiPL-E (0-shot).
Multilingual: classification by way of XCOPA (0 and 4-shot), machine translation with FLORES-101 (8-shot), and technology duties resembling MGSM (8-shot) and TyDiQA (1-shot).

Commonsense Reasoning

Nemotron-4 15B achieves the strongest common efficiency among the many in contrast baselines.

Well-liked Aggregated Benchmarks

Nemotron-4 15B achieves one of the best rating on BBH throughout present fashions.
Nemotron-4 is considerably higher than LLaMA-2 70B mannequin on the BBH benchmark.
Nemotron-4 15B moreover attains a extremely aggressive MMLU rating.

Math and Code

On mathematical reasoning Nemotron-4 15B achieves robust efficiency because it attains an analogous rating to Gemma 7B, however lags behind fashions resembling Baichuan-2 and QWEN.
On code duties, Nemotron-4 performs on par with QWEN 14B whereas remaining barely behind Gemma 7B.
Throughout each forms of duties, Nemotron-4 15B is ready to outperform Mistral 7B and LlaMA-2 13B/34B.

Multilingual

Classification

Nemotron-4 achieves one of the best efficiency amongst all fashions — realizing nearly a 12% enchancment within the four-shot setting

Technology

Nemotron-4 15B is ready to considerably enhance upon the following greatest mannequin, PaLM 62B-cont.

Nemotron-4 15B achieves one of the best efficiency amongst in contrast fashions and improves upon the closest rating by almost 30%.

Machine Translation

Nemotron-4 15B outperforms each LLaMA-2 13B and Baichuan-2 13B — enhancing upon their efficiency by 90.2% and 44.1% respectively.
Nemotron-4 15B doesn’t solely carry out properly on translating from Chinese language into English however is ready to attain spectacular outcomes on the direct translation of Chinese language into different languages.

Nemotron-4 15B Technical Report 2402.16819

Source link

Detailed Report on Image Analysis | by Cyril Picard | Sep, 2024

Confusion matrix for multiclass classification | by Abhishek Jain | Sep, 2024

Building Multi-Modal Models for Content Moderation

Leave A Reply Cancel Reply

Detailed Report on Image Analysis | by Cyril Picard | Sep, 2024

Confusion matrix for multiclass classification | by Abhishek Jain | Sep, 2024

Building Multi-Modal Models for Content Moderation

Why hundreds of Samsung workers are protesting in India

Your First Steps in AI: A Beginner’s Guide to Getting Started | by Imam Uddin | Sep, 2024

Most Popular

The Hamas Threat of Hostage Execution Videos Looms Large Over Social Media

Revolutionizing the Way We Find Love

Federal Investigators Widen Tesla Inquiry, Company Says

Our Picks

Detailed Report on Image Analysis | by Cyril Picard | Sep, 2024

Confusion matrix for multiclass classification | by Abhishek Jain | Sep, 2024

Building Multi-Modal Models for Content Moderation

Papers Explained 206: Nemotron-4 15B | by Ritvik Rastogi | Sep, 2024

Commonsense Reasoning

Well-liked Aggregated Benchmarks

Math and Code

Multilingual

Related Posts

Leave A Reply Cancel Reply