How an LLM becomes more coherent as we train it

Giles' blog

View Original ↗
AI 導讀 technology AI 重要性 3/5

訓練 1.63 億參數模型看過 10 億 token,即可從字詞沙拉進化為流暢句型。

  • 模型在完成三分之一訓練量時,即可產出連貫且合乎語法的文本。
  • 現代模型依賴分詞器,初期亂碼呈現為獨立單字而非散亂字母。
  • 後期近三分之二的算力投入,旨在確保模型輸出內容的事實正確性。

訓練一個擁有 1.63 億參數的大型語言模型,只要餵給它約 10 億個 token,它就能從毫無邏輯的字詞沙拉,進化出通順且具備語法的句型。前 OpenAI 研究員安德烈·卡帕斯曾在 2015 年展示神經網路如何逐字元學習寫作,如今透過現代 Transformer 架構,我們能精確記錄下語言模型在兩天內、經過 57 次存檔點的進化軌跡。

致敬 Karpathy 實驗與 GPT-2 模型基礎設定

回顧 2015 年,當時的神經網路主流仍是循環架構,模型必須從英文字母與標點符號開始,慢慢學會拼湊出有意義的單字。現代的 LLM(大型語言模型) 則採用截然不同的機制,為了觀察現代模型如何建立語意,作者從頭開始訓練了一個類似 GPT-2-small 架構的輕量級模型。

這個專案具備 1.63 億個參數,並使用了來自 Hugging Face 平台的 FineWeb 資料集進行訓練。總訓練資料量約為 32 億個 token,換算成純文字大約佔據 12.8 GiB 的硬碟空間。

執行訓練的這兩天內,系統會定期儲存當下的模型權重,總共留下了 57 個檢查點(checkpoints)。測試方法非常簡單:在每一個檢查點,都要求模型接續「每一次努力都推動著你(Every effort moves you)」這句話,在溫度值設定為 1 的條件下生成接下來的 20 個 token,藉此觀察模型語意連貫性的變化。

訓練前 617 步的單字沙拉與高頻詞彙學習

啟動訓練前(第 0 步),這個剛被創造出來、尚未更新任何權重的模型,給出的回應是一串完全不相干的字詞。例如它會吐出「esoteric Suns 1896ricia」,甚至包含「卑劣的資本家(despicable capitalists)」這類詞彙,形成無意義的堆疊。

觀察這個階段可以發現一個與 2015 年實驗的重大差異:它一開始輸出就是完整的單字或字根,而不是亂碼字母。這是因為現代模型依賴分詞器處理輸入與輸出,文本在進入模型前就已經被切分成一個個預先定義好的 token。

推進到第 617 個訓練步數時,情況有了微小的改變。在作者的設定中,模型每一步會看 96 筆長度為 1,024 token 的序列,並根據 損失函數(衡量模型預測誤差的指標) 進行更新。此時模型已經看過約 6,000 萬個 token,並開始學會輸出文章中最常見的虛詞,例如「and」、「to」、「was」、「the」。

繼續來到第 1234 步與 2468 步,輸出的句型逐漸浮現輪廓。雖然整體邏輯依然說不通,但已經可以看出諸如「接管他團隊的其餘部分(take the rest of his team)」、「把你移到不同的國家(moves you to a different country)」這類符合基本文法的英文短語。

突破 10 億 Token 門檻後湧現的商業文本

進入第 9255 步時,模型生成的內容開始帶有明顯的風格傾向。由於訓練資料來自直接抓取的網頁內容,其中包含了大量的商業行銷用語,模型便順勢寫出「確保您的客戶感到滿意(make sure that your clients are satisfied)」這樣四平八穩的句子。

推演至第 10489 步,模型甚至化身為心靈導師,產出了極其通順的勵志語錄:「每一次努力都推動著你成為你能達到的最好狀態...」。這種語氣的形成,部分原因可能也受到輸入提示詞本身的強烈引導。

檢視此時的進度,模型已經處理了約 10.3 億個 token,大約是整體訓練計畫的三分之一。對照訓練過程的曲線圖可以發現,損失數值的劇烈下降幾乎都在這個階段完成,一個僅完成三分之一進度的輕量級模型,已經能夠產出高度連貫且符合人類閱讀習慣的文本。

第 14191 步至 33164 步的格式修飾與標記

隨後的訓練過程中,模型不再有跳躍式的語法進步,而是開始展現出對文本格式與細微邏輯的掌握。在第 14191 步的檢查點,模型首次學會了使用項目符號(Bullet points)來列舉重點,建議讀者「發展有意義的習慣來促進業務」。

剖析第 25297 步與 26531 步的輸出,小型模型容易重複相同詞彙的特徵開始浮現。它會產出諸如「複雜問題的複雜性與複雜性(complex issue of complexity and complexity)」或連續重複兩次「這家公司」。這種不斷跳針的現象,在早期對話模型的輸出中也曾是常見的瑕疵。

抵達第 27765 步時,發生了一個有趣的技術現象。模型在生成了幾個字之後,輸出了 <|endoftext|> 這個特殊的控制符號。這代表模型判斷當前的上下文已經結束,並試圖開啟一份名為「Hip Hop: 紐約時報」的全新文件,顯示模型已經理解了不同文本之間的界線。

最終訓練在第 33164 步結束。此時的句子結構更加複雜且帶有轉折,例如在第 28382 步成功使用了「然而(however)」,並在最後一個檢查點寫下了具備因果關係的警告:「你得到了回報,但這還沒達到你的潛力」。

剩餘 20 億 Token 訓練旨在確保事實正確性

總結整個長達兩天的訓練週期,最令人驚訝的是這些簡單的語言模型達到「生成看似合理的文本」狀態的速度有多快。僅僅跑完三分之一的進度,它就已經掌握了人類語言的表象結構與基本文法。

探究這背後的工程意義,我們必須理解模型開發者的真實目標。我們需要的從來不僅僅是一個能快速產生流暢語句的內容生成器,而是希望這些內容兼具邏輯性與正確性。

花費龐大算力去磨完剩下三分之二的訓練進度,是為了確保深層事實的綁定。當開發者輸入「法國的首都」時,我們期望模型能夠精準回答「巴黎」,而不是利用它優美流暢的語法,自信滿滿地給出「盧昂(Rouen)」這種語意連貫卻完全錯誤的答案。

語言模型掌握語法結構的速度遠比想像中快,後期的龐大算力投資,實則是為了解決「流暢說謊」的幻覺問題並建立正確的知識映射。

Abstract

I remember finding it interesting when, back in 2015, Andrej Karpathy posted about RNNs and gave an example of how their output improves over the course of a training run. What might that look like for a (relatively) modern transformers-based LLM? I recently trained a GPT-2-small-style LLM, with 163 million parameters, on about 3.2 billion tokens (that's about 12.8 GiB of text) from the Hugging Face FineWeb dataset, and over the course of that training run, I saved the current model periodically -- 57 checkpoints over two days. Here's what it looked like -- the start, the end, and some interesting waypoints in between. For each checkpoint, I asked it to generate a completion to the words "Every effort moves you". 1 When the model was first created, before any training had been done, it came up with this: Every effort moves youhhhh esoteric Suns 1896ricia enormous initially speculative arenaelse anth Zimmerman Insight Sketch demonstr despicable capitalists clamp flung condemnation If you've read the Karpathy essay, you'll see one important difference -- it's already got words in there. His RNNs were generating complete noise at this stage. Even by the 100th iteration, he gives an example like this: tyntd-iafhatawiaoihrdemot lytdws e ,tfti, astai f ogoh eoase rrranbyne 'nhthnee e plia tklrgd t o idoe ns,smtt h ne etie h,hregtrs nigtike,aoaenns lng That's an important difference between the RNNs he was talking about, which were character-based and had to learn about words and the like, and LLMs like this one, where the text is input and then output one token at a time. (More info here). Still, even though it has what looks like words, it's essentially content-free token salad with no structure or coherence 2. Let's see what happens if we train it more. In my training loop, it sees 96 sequences of 1,024 tokens, and then we update it based on its loss (an index of how wrong it was at predicting next tokens), so that's 98,304 tokens for each step. After 617 of these 3 it seems to have mostly learned something about which tokens are most common: Every effort moves you and to was, in the, a, The your of- and | to the The By the next checkpoint at step 1234, we've got something that's starting to come together. It doesn't make sense, but there's some kind of glimmering of meaning: Every effort moves you’ll take the rest of the mainstay in all of his team. This year with a And just a little while later, at the checkpoint at step 2468, we have something that actually makes some kind of sense (at least at the start)! Every effort moves you to a different country. For all the most part, a world map can only see the world map Now, the training data I'm using was scraped from the Internet, and unsurprisingly there's a lot of somewhat cheesy business content there. By step 9255, we're starting to get a lot of stuff like this: Every effort moves you forward and it is important to make sure that your clients are satisfied. A number of people have ...or even more cheesy self-help stuff (step 10489): Every effort moves you to be the best that you will ever have. To be your best, you should be able to To be fair, the starting point of "Every effort moves you" is probably biasing things a bit there. But let's be clear: by this point it's seen 1,031,110,656 tokens -- that is, it's about one third trained. And it's coming up with pretty coherent text! The rest of the training run is more about refining things -- the loss chart for this training run looks like this: Loosely speaking, the lower the loss number, the better the model is, so you can see that the bulk of the improvement had happened by this point. From here on, I'll just give a few of the more interesting samples: By step 14191, it's started using bullet points... Every effort moves you towards your goals. - Develop meaningful habits or habits that promote your business - Keep personal and Step 24680 -- more motivational stuff: Every effort moves you forward and keeps you motivated. You make sure you don’t leave it alone. A Step 25297 -- small models like this do like repeating themselves. You might remember seeing ChatGPT output back in 2023 or so that had tics like this: Every effort moves you from a simple position to a complex issue of complexity and complexity. As soon as the book takes And again at step 26531 Every effort moves you, the company, the company, the community and all those involved. I will be pleased to say At step 27765 it decides that it has had enough after generating just a couple of words and tries to start a new document: Every effort moves you to the next level.Hip Hop: The New York Times, April 23, 2017 But step 28382 is actually rather good. I particularly like the "however": Every effort moves you, however, towards a better future, and that’s what counts as a win. And finally, the training run finishes at step 33164 with these wise words of caution: Every effort moves you, and you’re rewarded, but not to your potential. You’ve got to Well worth remembering, I'm sure we can all agree. I wonder what deep wisdom we'd have gained if I had asked it to generate more than 20 new tokens... What I found most surprising when I first started playing with this is how fast even simple LLMs got to a stage where they could generate plausible text. Just one third of the way through the training run, this model was making some kind of sense. The problem, of course, is that we don't just want generators of plausible content -- we want that content to make sense and be correct. And that's why it's worth grinding through the other two thirds -- in the hope that when you ask it to complete "The capital of France is", it will reply with "Paris" rather than a coherent but wrong answer like "Rouen". Technical details: 20 GPT-2 tokens generated on top of the initial text, with a temperature of 1. I've added line breaks to make it easier to read the samples. ↩ Well, it mentions " despicable capitalists", but I suspect that's just randomness rather than some kind of primitive political consciousness. Including the space at the start, that's tokens 47034 and 32663 in the GPT-2 tokeniser. ↩ So, 60,653,568 tokens seen. ↩