小孩个子矮小吃什么促进生长发育| 孔子的原名叫什么| 什么材料| 螺蛳粉为什么叫螺蛳粉| 脂肪肝吃什么好| 渗透率是什么意思| 鼠入牛角是什么动物| 白狐寓意着什么| 肝ca什么意思| 突然腰疼是什么原因| 致青春是什么意思| 金庸的原名叫什么| 解脲支原体是什么| 什么是69式| t1什么意思| 两肺少许纤维灶是什么意思| 前列腺炎忌口什么食物| 歹人是什么意思| 什么时候吃苹果最好| 信物是什么意思| 高血糖适合吃什么水果| 子宫肌瘤吃什么能消除| 日记可以写什么| 100年前是什么朝代| 基点是什么意思| 喝什么水减肥最快| 纪委是干什么的| 反流性食管炎b级是什么意思| 12月13号什么星座| 神夫草抑菌乳膏主治什么| 伐木累是什么意思| 决明子有什么功效| 小乌龟吃什么东西| 印堂在什么位置| 时来运转是什么意思| 五月十四号是什么情人节| 尿蛋白两个加号是什么意思| 定制和订制有什么区别| 会厌炎吃什么药最有效| 吃什么可以补钾| 有机食品是什么意思| 乳腺4a是什么意思| 20至30元什么烟最好抽| 梦到丢了一只鞋是什么意思| 梦见小婴儿是什么意思| 心衰吃什么恢复的快| 黄豆炒什么好吃| 中的反义词是什么| 什么是乳糖不耐受| sos代表什么| 甚嚣尘上什么意思| 保险公司最怕什么投诉| 脱落细胞学检查是什么| 脾胃虚寒有什么症状| 采耳是什么| 心脏t波改变吃什么药| 蝉吃什么食物| 靶子是什么意思| 倾诉是什么意思| 元神是什么意思| 水痘疫苗叫什么| 头顶冒汗是什么原因| 飞蚊症用什么眼药水| 打玻尿酸有什么危害| 89年是什么年| cea升高是什么意思| 蹦蹦跳跳的动物是什么生肖| 飞机打多了会有什么后果| 珠是什么生肖| 怀孕不能吃什么| 检查肝肾功能挂什么科| 今天出生的男宝宝取什么名字好| 做牛排需要什么调料| b超检查什么| 单核细胞偏高是什么原因| 刀子是什么意思| 尿糖一个加号是什么意思| 高考明天考什么| 结膜炎吃什么药| 缅甸的首都叫什么名字| 2025年属什么生肖| 胎毛什么时候剃最好| 血糖高吃什么能降糖| 药流有什么危害| 三高是什么| ab型血和o型血生的孩子是什么血型| 过午不食什么意思| 支气管炎是什么引起的| 温婉是什么意思| 经常头晕吃什么食物好| 为什么打雷闪电| 为所当为什么意思| 化生细胞有是什么意思| 什么是幂| 什么情况下喝补液盐| 封神榜是什么| 慢性疾病都包括什么病| 一丝不苟是什么生肖| 9月28号什么星座| 1969年属什么生肖| 孕妇肚子疼是什么原因| 杏和什么不能一起吃| 佛山有什么特产| 高处不胜寒什么意思| 今年什么时候暑伏| 红豆与赤小豆有什么区别| 黄色裤子配什么上衣好看| 恶露是什么意思| 造血干细胞是什么| 五谷指的是什么| m1是什么单位| 强直性脊柱炎吃什么药| 李白属什么生肖的| 熙熙攘攘什么意思| 什么的辨认| 血压低是什么原因| 10月21日什么星座| 槊是什么意思| 京东pop是什么意思| 加湿器加什么水最好| 藜芦是什么| 什么是呆账| 血热吃什么药好得快| 蒲公英有什么药效| 鸭肉炖什么好吃| 什么不断| 腋下是什么经络| 更年期有什么症状| 6月8日是什么星座| 阴囊潮湿瘙痒用什么药| 三亚是什么海| 处男是什么意思| dpoy什么意思| 清朝什么时候建立| 四象是什么| 什么地方能出生入死| silk什么意思| 梦见自己有孩子了是什么预兆| 小便痒痒是什么原因女| 嫣然是什么意思| 中国的国宝是什么| 胃窦炎吃什么药最好| 胃有息肉的症状是什么| 脚后跟疼为什么| hrd是什么职位| 女命带驿马是什么意思| 平安果什么时候吃| 颈部有肿块看什么科室| 肩周炎贴什么膏药效果最好| clot什么牌子| 抽血生化是查什么| 先下手为强是什么意思| 什么空调| 撇嘴是什么意思| 正常的心电图是什么样的图形| 为什么空调外机会滴水| 什么是机械键盘| 磨玻璃影是什么意思| 诸葛亮是什么星座| 朋友圈为什么发不出去| 常喝普洱茶有什么好处| 黑藻是什么植物| 梦见自己输液是什么意思| 过氧化氢是什么意思| 宫颈糜烂有什么症状| 闺房是什么意思| 情劫什么意思| 为什么糙米越吃血糖越高| 什么是高情商| 状元是什么意思| 什么治疗咽炎效果好| 什么叫静脉曲张| 男人梦见血是什么预兆| 奶水不足吃什么| 德五行属什么| 吃什么能治结石| 松鼠桂鱼是什么鱼| 牛腩是什么| 朝鲜战争的起因是什么| 跳蚤最怕什么| 眼有眼屎是什么原因| 什么是筋膜| 无语凝噎是什么意思| 反差萌是什么意思| 吃黄芪有什么好处| 拾人牙慧的意思是什么| 毒奶粉是什么游戏| 黑猫警长是什么猫| 宫颈病变是什么| 什么叫义齿| graves病是什么病| 兜兜转转是什么意思| 城隍是什么意思| 湿气重能吃什么水果| 为什么手比脸白那么多| 嗜碱性粒细胞比率偏高说明什么| 奶芙是什么| 山狸是什么动物| 巳蛇五行属什么| 牛肉不能和什么食物一起吃| 梦见自己儿子死了是什么意思| 拌黄瓜需要什么调料| 熊猫长什么样| 小河边有什么| 赵字五行属什么| 小腿发痒是什么原因| 早上口干苦是什么原因| 什么是潮热症状| 黄芪和什么泡水壮阳| 虫咬性皮炎用什么药| 丙肝有什么症状表现| 什么空调省电| 嗓子挂什么科| 腹泻吃什么食物好得快| 玖字五行属什么| 姓蔡的女孩起什么名字| 紫色芒果是什么品种| 舌头两边锯齿状是什么原因| 尿检阴性是什么意思| seiko是什么牌子| 利而不害为而不争是什么意思| 胰腺吃什么药| 治疗幽门螺杆菌用什么药效果最好| 掂过碌蔗是什么意思| 个子矮吃什么才能长高| 心脏变大是什么原因| 爵迹小说为什么不写了| prg是什么意思| 男人吃什么容易生儿子| 翡翠对人体有什么好处| 耳道炎是什么原因引起的| 甲钴胺片是治什么的| 权衡是什么意思| 应届是什么意思| 小孩嗓子疼吃什么药| 吃什么能长胖| 川芎治什么病最好| 护照需要什么材料| 乙状结肠腺瘤是什么病| ph值小于7是什么意思| 2012年什么年| 纳豆什么味道| 什么是血铅| 脚痒脱皮是什么原因| 马步鱼为什么是甜的| 外痔长什么样| 经常打屁是什么原因| 为什么不孕不育| ts什么意思网络上| 怀孕小肚子疼是什么原因| 燕窝有什么功能| 荒唐是什么意思| 热闹对什么| 肺癌靶向治疗是什么意思| 旗袍穿什么鞋子好看图| 抽烟头晕是什么原因| 梦见死人复活是什么意思| 肌肉紧张是什么症状| 反水什么意思| 蒂是什么意思| 肝郁气滞吃什么中成药| 麦字五行属什么| ppd是什么| 上皮细胞什么意思| 百度
NVIDIA B200s are live on Lambda Cloud! Deploy your cluster today 

易炼红:坚决兑现实现脱贫攻坚目标的庄严承诺

gpt-3-technical-overview
百度 1939年3月“东战团”抵达八路军冀中军区司令部。

by Chuan Li, PhD

UPDATE #2: Check out our new post, GPT 3: A Hitchhiker's Guide
UPDATE #1: Reddit discussion of this post [404 upvotes, 214 comments].

OpenAI recently published GPT-3, the largest language model ever trained. GPT-3 has 175 billion parameters and would require 355 years and $4,600,000 to train - even with the lowest priced GPU cloud on the market.[1]

GPT-3 Key Takeaways

  • GPT-3 shows that language model performance scales as a power-law of model size, dataset size, and the amount of computation.
  • GPT-3 demonstrates that a language model trained on enough data can solve NLP tasks that it has never encountered. That is, GPT-3 studies the model as a general solution for many downstream jobs without fine-tuning.
  • The cost of AI is increasing exponentially. Training GPT-3 would cost over $4.6M using a Tesla V100 cloud instance.
  • The size of state-of-the-art (SOTA) language models is growing by at least a factor of 10 every year. This outpaces the growth of GPU memory. For NLP, the days of "embarrassingly parallel" is coming to the end; model parallelization will become indispensable.
  • Although there is a clear performance gain from increasing the model capacity, it is not clear what is really going on under the hood. Especially, it remains a question of whether the model has learned to do reasoning, or simply memorizes training examples in a more intelligent way.

Diving into the Model

GPT-3 comes in eight sizes, ranging from 125M to 175B parameters. The largest GPT-3 model is an order of magnitude larger than the previous record holder, T5-11B. The smallest GPT-3 model is roughly the size of BERT-Base and RoBERTa-Base.

All GPT-3 models use the same attention-based architecture as their GPT-2 predecessor. The smallest GPT-3 model (125M) has 12 attention layers, each with 12x 64-dimension heads. The largest GPT-3 model (175B) uses 96 attention layers, each with 96x 128-dimension heads.

GPT-3 expanded the capacity of its GPT-2 by three orders of magnitudes without significant modification of the model architecture — just more layers, wider layers, and more data to train it on.

Understanding the Data

Since Neural Networks are compressed/compiled version of the training data, the size of the dataset has to scale accordingly with the size of the model. GPT-3 175B is trained with 300 Billion tokens collected from a weighted combination of the following datasets:

Dataset # Tokens (Billions)
Common Crawl (filtered by quality) 180.4
WebText2 55.1
Books1 22.8
Books2 23.65
Wikipedia 10.2

Notice GPT-2 1.5B is trained with 40GB of Internet text, which is roughly 10 Billion tokens (conversely assuming the average token size is 4 characters). So GPT-3 175B has a lower data compression ratio 300 / 175 = 1.71 in comparison to GPT-2 1.5G 10 / 1.5 = 6.66. This raises the question that, with this amount of parameters, whether the model functions by memorizing the data in the training and pattern matching in inference. We will have more discussions later in this article.

One novel challenge GPT-3 has to deal with is data contamination. Since their training dataset is sourced from the internet, it is possible that the training data will overlap with some of the testing datasets. Although GPT-2 has touched this topic, it is particularly relevant to GPT-3 175B because its dataset and model size is about two orders of magnitude larger than those used for GPT-2, creating increased potential for contamination and memorization.

To investigate the impact of data contamination, the OpenAI team produce a "clean" version of the testing dataset for each downstream task, which removes all potentially leaked examples, defined roughly as "examples that have a 13-gram overlap with anything in the training set". They then evaluate GPT-3 on these cleaned testing datasets and compare to the scores of the original "un-cleaned" datasets.

The comparisons show that the performance on most benchmarks changed negligibly. However, there are a few tasks that were significantly impacted by the data clean process. OpenAI flagged these tasks for further review.

Training the Model

GPT-3 is trained using next word prediction, just the same as its GPT-2 predecessor. To train models of different sizes, the batch size is increased according to number of parameters, while the learning rate is decreased accordingly. For example, GPT-3 125M use batch size 0.5M and learning rate of 6.0×10?4 , where GPT-3 175B uses batch size 3.2M and learning rate of 0.6×10?4 .

We are waiting for OpenAI to reveal more details about the training infrastructure and model implementation. But to put things into perspective, GPT-3 175B model required 3.14E23 FLOPS of computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run. Similarly, a single RTX 8000, assuming 15 TFLOPS, would take 665 years to run.

Time is not the only enemy. The 175 Billion parameters needs 175×4=700GB memory to store in FP32 (each parameter needs 4 Bytes). This is one order of magnitude larger than the maximum memory in a single GPU (48 GB of Quadro RTX 8000). To train the larger models without running out of memory, the OpenAI team uses a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models were trained on V100 GPU’s on the part of a high-bandwidth cluster provided by Microsoft.

In fact, The size of SOTA language model increases by at least a factor of 10 every year: BERT-Large (2018) has 355M parameters, GPT-2 (early 2019) reaches 1.5B, T5 (late 2019) further streches to 11B, GPT-3 (mid-2020) finally gets to 175B. The progress of the sizes of language models clearly outpace the growth of GPU memory. This implies that for NLP, the days of "embarrassingly parallel" is coming to the end, and model parallelization is going to be indispensable for researching SOTA language models.

Running Inference

This is where GPT models really stand out. Other language models, such as BERT or transformerXL, need to be fine-tuned for downstream tasks. For example, to use BERT for sentiment classification or QA, one needs to incorporate additional layers that run on top of the sentence encodings. Since we need one model per task, the solution is not plug-and-play.

However, this is not the case for GPT models. GPT uses a single model for all downstream tasks. Last year, OpenAI already showed GPT-2's potential as a turn-key solution for a range of downstream NLP tasks without fine-tuning. The new generation, GPT-3, uses a more formatted approach for running inference, and demonstrate even superior performance.

It uses a paradigm which allows zero, one, or a few examples to be prefixed to the input of the model. For example, in the few-shot scenario, the model is presented with a task description, a few dozen examples, and a prompt. GPT-3 then takes all this information as the context and start to predict output token by token. The situation is similar to zero-shot and one-shot; only the number of examples are reduced.

Let's use the task of English to French translation as a concrete example: the task description can be the sentence "Translation English to French." The few dozen examples may include text such as "sea otter => loutre de mer" and "peppermint => menthe poivree" etc. The prompt is the Enligsh word to be translated, for example, "cheese => ." Then the model is expected to output the French word for cheese, which is "fromage."

Results

Next, we briefly discuss the performance of GPT-3 using some of the downstream tasks.

Text Generation

This is GPT's rockstar application -- a conditional generative model that creates near-human level quality text content. Given the beginning of some article, the model is asked to generate the rest of the story in a word by word fashion.

More precisely, GPT-3 is presented with a title, a subtitle, and the prompt word "Article: ." It then writes short articles (~200 words) that fools human most of the time. According to OpenAI's user study, "mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above change at ~52%". Meaning humans will make random guesses while asking to detect GPT-3 generated articles. In contrast, the mean human accuracy at detecting articles produced by the smallest GPT-3 model (125M) is 76%.

This can be a big deal — "simply" increasing the size of the model by three orders of mangnitude is able to change something that is half-working into something non-distinguishable from human work. In plain English, this empirically shows that the number of model parameters, the FLOP/s-days and the number of training examples needs to grow according to a power function of the improvement of the model.

Of course, GPT-3 may still produce non-factual content (such as suggesting the popular U.S. TV program "The Tonight Show" is hosted by Megyn Kelly instead of Jimmy Fallon), nor did OpenAI claim the model is ready for writing the last two books of "A Song of Ice and Fire." Nonetheless, getting closer to the finishing line of the Turing test for writing short articles is significant, and will no doubts have its great impact on our social media.

General NLP Tasks

Although writing a new article is cool, the killer feature of GPT-3 is the ability to be 're-programmed' for general NLP tasks without any fine-tuning. This is where OpenAI's real ambition lies: having a model to do just about anything by conditioning it with a few examples.

The paper showed a dozen of downstream tasks, ranging from the usual players such as machine translation and question and answer to the unexpected new tasks such as arithmetic computation and one-shot learning of novel words. Instead of reiterating the details of each task, the rest of this article will discuss some common patterns across the board.

The Role of the Few-Shots

What is the role of the few examples that fed to GPT-3 model before it makes predictions? Do more examples improve the results?

One way to think about these examples is that they "teach" the model how to do reasoning. It would be amazing if this is really the case, because it shows that the model can indeed be reprogrammed for new tasks very quickly. However, it is not clear how such a reprogramming process works under the hood. GPT-3 tends to perform better on language modeling tasks and less well on reasoning tasks. For example, GPT-3 really thrives in the task of machine translation, especially when the target language is English. It even beats the fine-tuned SOTA in the tasks of WMT Fr->En and WMT De->En. On the other hand, GPT-3 performs significantly less well than fine-tuned SOTA in SuperGLUE's BoolQ task. This task asks the model to read a short passage and then answer a related True/False question. The 15% gap between the fine-tuned SOTA and GPT-3 few shots seems to suggest that model isn't particularly strong in terms of conducting reasoning based on a passage that was not seen in the training.

Another interesting view is that these examples function as "filters" that let the model search for highly relevant context or patterns from the dataset. This is possible because the dataset is practically compressed into the weights of the network. Examples that have a strong response to the filters are then "interpolated" to produce the output of the model. Obviously, the more examples you give to the model, the more precise the filter becomes, and in consequence, the better the results.

At this stage, I found the second explanation probably makes more sense. Language models are designed to generate readable texts. They do not have a deep "understanding" of the physical world, nor are they trained to do sophisticated reasoning. Think about how we get to understanding the world — reading newspapers and novels is not enough. Otherwise, there will be no need to study math, physics, engineering, etc.

One particularly interesting case is arithmetic calculation: the model give a perfect score for 2-digits addition and subtraction (100% accuracy), but "failed" to do five digits (less than 10% accuracy). I found this rather interesting: There are in total 100000?100000=10 Billion different combinations for five digits addition. Every example takes at least five tokens (the two input numbers, the plus sign, and the equal sign and the output number). So there it requires least 5 Billion tokens to store 10% of the examples. The entire training dataset has 300 Billion tokens. So to argue the network is purely memorizing the training data, there should be at least 5 / 300 ≈ 1.7 % of the training data are five-digits addition. I honestly don't think we see five-digits addition appear that often in my daily life. This indicates the network, at a certain degree, was learning to work with numbers instead of memorizing their combinations.

The Next Level

GPT-3 has generated a lot of discussion on Hacker News. One comment I found particularly intriguing compares human brain with where we are with the language models: A typical human brain has over 100 trillion synapses, which is another three orders of magnitudes larger than the GPT-3 175B model. Given it takes OpenAI just about a year and a quarter to increase their GPT model capacity by two orders of magnitude from 1.5B to 175B, having models with trillions of weight suddenly looks promising.

If GPT-2 was "too dangerous to release," and GPT-3 almost passed the Turing test for writing short articles. What can a trillion parameter model do? For years the research community has been searching for chatbots that "just works," could GPT-3 be the breakthrough? Is it really possible to have a massive pre-trained model, so any downstream tasks become a matter of providing a few examples or descriptions in the prompt? At a broader scale, can this "data compilation + reprogram" paradigm ultimately lead us to AGI? AI safety needs to go a long way to prevent techniques like these from being misused, but it seems the day of having truly intelligent conversations with robots is just at the horizon.


Footnotes

1. We use Lambda GPU instance pricing of $1.50 / hour for a Tesla V100 (8x Tesla V100s are $12.00/hour). 355Y×365D/Y×24H/D×$1.5/H=$4.6M

2. NVIDIA's flagship server grade GPU increased its memory from 32GB to 40GB over the past two years.

3. With the exception that GPT-3 use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer used in "Generating long sequences with sparse transformers", Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever, 2019

4. The accuracy measures how successfully human can detect machine generated articles. Lower detection rate means better performance of the GPT model. The theoretical best case would be human are making random guess at a 50% successful rate, as he/she can not distinguish what is real from what is fake.

5. GPT-3 shows that it is possible to improve the performance of a model by "simply" increasing the model size, and in consequence the dataset size and the computation (TFLOP) the model consumes. However, as the performance increases, the model size has to increase more rapidly. Precisely, the model size varies as some power of the improvement of model performance.

6. We double V100's theoretical 14 TFLOPS of FP32 to get its theoretical 28 TFLOPS of FP16. Notice optimized ML models can leverage V100's Tensor Cores (112 TFLOPS) to further speed up the training. However, we do not expect to see 8x speedup from Tensor Core. Because the real speed up from mixed precision training in comparison to FP32 is usually shy of 2x for image models, and upto 4x for language models that have high sparsity, even with implementations that highly optimized for Tensor Cores.

 

 

胸部彩超能检查出什么 螃蟹是什么季节吃的 药店加盟需要什么条件 2月出生的是什么星座 琅琊榜是什么意思
hpv感染是什么 ca199是什么检查项目 睡觉出汗多是什么原因 惠州有什么特产 cheese是什么意思
叶公好龙讽刺了什么 什么是抽动症 朝什么暮什么 aj是什么鞋子 尿胆原弱阳性是什么意思
拔罐出水泡是什么原因 唐朝什么时候灭亡的 但闻人语响的但是什么意思 nerdy是什么牌子 梦见黑熊是什么预兆
元武道是什么hcv8jop7ns8r.cn 小本生意做什么好赚钱快bfb118.com 一个木一个西读什么hcv8jop9ns9r.cn 大便是红色的是什么原因hcv8jop6ns7r.cn blm是什么意思hcv9jop0ns2r.cn
男人染上霉菌什么症状jingluanji.com m是什么码sscsqa.com 徘徊是什么意思hcv9jop4ns4r.cn 绿茶是什么hcv7jop5ns5r.cn 小孩口腔溃疡是什么原因引起的hcv8jop9ns1r.cn
咳嗽咳到吐是什么原因gangsutong.com fans是什么意思hcv9jop4ns2r.cn 一夜白头是什么原因dajiketang.com 舌头下面的筋叫什么hcv9jop4ns9r.cn 拔牙后需要注意什么hcv9jop7ns9r.cn
一月8日是什么星座hcv8jop1ns3r.cn 2d是什么意思hcv8jop5ns2r.cn 窗口期是什么意思hcv8jop4ns9r.cn 爸爸是什么意思hcv8jop6ns2r.cn 小腿酸软无力是什么原因hcv9jop0ns2r.cn
百度