Open Pre-Trained Transformer Language Models (OPT): What does it take to train GPT-3?

2022/6/16

Neural Search Talks — Zeta Alpha

Shownotes Transcript

Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella i Sapé discuss the recent "Open Pre-trained Transformer (OPT) Language Models" from Meta AI (formerly Facebook). In this replication work, Meta developed and trained a 175 Billion parameter Transformer very similar to GPT-3 from OpenAI, documenting the process in detail to share their findings with the community. The code, pretrained weights, and logbook are available on their Github repository (links below).

Links

❓Feedback Form): https://scastella.typeform.com/to/rg7a5GfJ

📄 OPT paper): https://arxiv.org/abs/2205.01068

👾 Code:) https://github.com/facebookresearch/metaseq

📒 Logbook): https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf

✍️ OPT Official Blog Post): https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/

OpenAI Embeddings API): https://openai.com/blog/introducing-text-and-code-embeddings/

[Nils Reimers' critique of OpenAI Embeddings API](//Nils Reimers' critique of OpenAI Embeddings API)): https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9

Timestamps:

00:00 Introduction and housekeeping: new feedback form, ACL conference highlights

02:42 The convergence between NLP and Neural IR techniques

06:43 Open Pretrained Transformer motivation and scope, reproducing GPT-3 and open-sourcing

08:16 Basics of OPT: architecture, pre-training objective, teacher forcing, tokenizer, training data

13:40 Preliminary experiments findings: hyperparameters, training stability, spikiness

20:08 Problems that appear at scale when training with 992 GPUs

23:01 Using temperature to check whether GPUs are working

25:00 Training the largest model: what to do when the loss explodes? (which happens quite often)

29:15 When they switched away from AdamW to SGD

32:00 Results: successful but not quite GPT-3 level.

Toxicity? 35:45 Replicability of Large Language Models research. Was GPT-3 replicable? What difference does it make?

37:25 What makes a paper replicable?

40:33 Directions in which large Language Models are applied to Information Retrieval

45:15 Final thoughts and takeaways

Open Pre-Trained Transformer Language Models (OPT): What does it take to train GPT-3? 47:12 Share

Neural Search Talks — Zeta Alpha

Shownotes Transcript

Open Pre-Trained Transformer Language Models (OPT): What does it take to train GPT-3?