CS336 Assignment - 1
1.Assignment Overview
In this assignment, you will build all the components needed to train a standard Transformer language model (LM) from scratch and train some models.
在本次作业中,您将从零开始构建训练标准 Transformer 语言模型(LM)所需的所有组件,并训练一些模型。
What you will implement
- Byte-pair encoding (BPE) tokenizer (BPE)编码
- Transformer language model (LM)
- The cross-entropy loss function and the AdamW optimizer 交叉熵损失函数和AdamW优化器
- The training loop, with support for serializing and loading model and optimizer state 支持序列化和加载模型及优化器状态的训练循环
What you will run
- Train a BPE tokenizer on the TinyStories dataset. 在 TinyStories 数据集上训练一个 BPE 分词器。
- Run your trained tokenizer on the dataset to convert it into a sequence of integer IDs. 对数据集应用您训练好的分词器,将其转换为一系列整数标识符。
- Train a Transformer LM on the TinyStories dataset. 在“TinyStories”数据集上训练一个 Transformer 模型。
- Generate samples and evaluate perplexity using the trained Transformer LM. 使用已训练的 Transformer 模型生成样本并评估困惑度。
- Train models on OpenWebText and submit your attained perplexities to a leaderboard. 在 OpenWebText 上训练模型,并将您获得的困惑度提交至排行榜。
What you can use
We expect you to build these components from scratch. In particular, you may not use any definitions from torch.nn, torch.nn.functional, or torch.optim except for the following:
我们希望你从零开始构建这些组件。你不可以使用来自 torch.nn、torch.nn.functional 或 torch.optim 的任何定义,除了:
torch.nn.Parameter(用于定义可训练参数)- Container classes in
torch.nn(e.g., Module, ModuleList, Sequential, etc.) 容器类,用于组织网络结构 - The
torch.optim.Optimizer baseclass 基类(用于创建自定义优化器)
You may use any other PyTorch definitions. If you would like to use a function or class and are not sure whether it is permitted, feel free to ask on Slack. When in doubt, consider if using it compromises the “from-scratch” ethos of the assignment.
你可以使用任何其他的 PyTorch 定义。如果你想要使用某个函数或类但不确定是否允许,可以随时在 Slack 上提问。如有疑问,请考虑使用它是否会违背本次作业“从零开始”的宗旨。
Statement on AI tools
Prompting LLMs such as ChatGPT is permitted for low-level programming questions or high-level conceptual questions about language models, but using it directly to solve the problem is prohibited.
允许使用ChatGPT等大语言模型来回答低层次的编程问题或关于语言模型的高层次概念性问题,但禁止直接用其解决具体问题。
We strongly encourage you to disable AI autocomplete (e.g., Cursor Tab, GitHub CoPilot) in your IDE when completing assignments (though non-AI autocomplete, e.g., autocompleting function names is totally fine). We have found that AI autocomplete makes it much harder to engage deeply with the content.
我们强烈建议你在完成作业时禁用AI自动补全功能(例如Cursor Tab、GitHub CoPilot),尽管非AI的自动补全(例如函数名的自动补全)完全没问题。我们发现AI自动补全会显著降低你深入理解内容的效果。
What the code looks like
All the assignment code as well as this writeup are available on GitHub at:
所有作业代码以及本文档均可在 GitHub 上获取:
https://github.com/stanford-cs336/assignment1-basics
Please git clone the repository. If there are any updates, we will notify you so you can git pull to get the latest.
请 git clone 该代码仓库。如果有更新,我们会通知你,以便你可以通过 git pull 获取最新内容。
cs336_basics/*This is where you write your code. Note that there’s no code in here—you can do whatever you want from scratch! 这是你编写代码的地方。请注意,这里没有任何代码——你可以从零开始做任何你想做的事情!adapters.pyThere is a set of functionality that your code must have. For each piece of functionality (e.g., scaled dot product attention), fill out its implementation (e.g.,run_scaled_dot_product_attention) by simply invoking your code. Note: your changes toadapters.pyshould not contain any substantive logic; this is glue code. 你的代码需要实现一组特定的功能。对于每个功能模块(例如:缩放点积注意力机制),你只需要在对应的适配器函数中(如run_scaled_dot_product_attention)调用你已实现的代码即可。注意:你对adapters.py文件的修改不应包含任何实质性逻辑,这只是一个"粘合代码"。test_*.pyThis contains all the tests that you must pass (e.g.,test_scaled_dot_product_attention), which will invoke the hooks defined inadapters.py. Don’t edit the test files. 这包含了所有你必须通过的测试(例如:test_scaled_dot_product_attention),这些测试将调用在adapters.py中定义的钩子函数。请不要编辑测试文件。
How to submit
You will submit the following files to Gradescope:
writeup.pdfAnswer all the written questions. Please typeset your responses. 回答所有书面问题。请将您的回答排版。code.zip:Contains all the code you’ve written. 包含你编写的所有代码。
To submit to the leaderboard, submit a PR to:
要提交到排行榜,请提交一个PR至:
github.com/stanford-cs336/assignment1-basics-leaderboard
See the README.md in the leaderboard repository for detailed submission instructions.
请参阅排行榜仓库中的 README.md 以获取详细的提交说明。
Where to get datasets
This assignment will use two pre-processed datasets: TinyStories [Eldan and Li,
2023] and OpenWebText [Gokaslan et al., 2019]. Both datasets are single, large plaintext files. If you are doing the assignment with the class, you can find these files at /data of any non-head node machine.
本次作业将使用两个预处理过的数据集:TinyStories [Eldan and Li, 2023] 和 OpenWebText [Gokaslan et al., 2019]。这两个数据集均为单一的大型纯文本文件。如果你是随课程完成作业,可以在任何非头节点机器的 /data 目录下找到这些文件。
If you are following along at home, you can download these files with the commands inside the README.md.
如果你在家跟着操作,可以通过 README.md 中的命令下载这些文件
Low-Resource/Downscaling Tip: Init
Throughout the course’s assignment handouts, we will give advice for working through parts of the assignment with fewer or no GPU resources. For example, we will sometimes suggest downscaling your dataset or model size, or explain how to run training code on a MacOS integrated GPU or CPU. You’ll find these “low-resource tips” in a blue box (like this one). Even if you are an enrolled Stanford student with access to the course machines, these tips may help you iterate faster and save time, so we recommend you to read them!
在整个课程的作业说明中,我们会提供一些建议,帮助你在 GPU 资源有限或没有 GPU 的情况下完成作业。例如,我们有时会建议缩小数据集或模型规模,或说明如何在 macOS 集成 GPU 或 CPU 上运行训练代码。这些“低资源提示”会以蓝色框标注(就像这个框一样)。即使你是可以访问课程计算设备的斯坦福注册学生,阅读这些提示也有助于你更快地迭代并节省时间,因此我们建议你仔细阅读!
Low-Resource/Downscaling Tip: Assignment 1 on Apple Silicon or CPU
With the staff solution code, we can train an LM to generate reasonably fluent text on an Apple M3 Max chip with 36 GB RAM, in under 5 minutes on Metal GPU (MPS) and about 30 minutes using the CPU. If these words don’t mean much to you, don’t worry! Just know that if you have a reasonably up-to-date laptop and your implementation is correct and efficient, you will be able to train a small LM that generates simple children’s stories with decent fluency.
使用员工解决方案代码,我们可以在配备36 GB内存的Apple M3 Max芯片上,通过Metal GPU(MPS)在不到5分钟内,或使用CPU约30分钟内,训练出一个能够生成流畅文本的语言模型。如果你对这些术语不太了解,也不用担心!你只需知道,只要拥有一台配置较为现代的笔记本电脑,并且实现方法正确且高效,就能训练出一个能以不错的流畅度生成简单儿童故事的小型语言模型。
Later in the assignment, we will explain what changes to make if you are on CPU or MPS.
在任务的后续部分,我们将说明如果你使用的是CPU或MPS时需要进行哪些更改。
2.Byte-Pair Encoding (BPE) Tokenizer
In the first part of the assignment, we will train and implement a byte-level byte-pair encoding (BPE) tokenizer [Sennrich et al., 2016, Wang et al., 2019]. In particular, we will represent arbitrary (Unicode) strings as a sequence of bytes and train our BPE tokenizer on this byte sequence. Later, we will use this tokenizer to encode text (a string) into tokens (a sequence of integers) for language modeling.
在作业的第一部分,我们将训练并实现一个字节级别的字节对编码(BPE)分词器 [Sennrich et al., 2016, Wang et al., 2019]。具体来说,我们将任意(Unicode)字符串表示为字节序列,并在此字节序列上训练我们的BPE分词器。之后,我们将使用该分词器将文本(字符串)编码为标记(整数序列),用于语言建模。
2.1 The Unicode Standard
Unicode is a text encoding standard that maps characters to integer code points. As of Unicode 16.0 (released in September 2024), the standard defines 154,998 characters across 168 scripts. For example, the character “s” has the code point 115 (typically notated as U+0073, where U+ is a conventional prefix and 0073 is 115 in hexadecimal), and the character “牛” has the code point 29275. In Python, you can use the ord() function to convert a single Unicode character into its integer representation. The chr() function converts an integer Unicode code point into a string with the corresponding character.
Unicode 是一种文本编码标准,它将字符映射为整数码点。截至 Unicode 16.0(2024 年 9 月发布),该标准共定义了 154,998 个字符,涵盖 168 种文字体系。例如,字符“s”的码点为 115(通常记作 U+0073,其中 U+ 是约定前缀,0073 是十六进制下的 115),字符“牛”的码点为 29275。在 Python 中,您可以使用 ord() 函数将单个 Unicode 字符转换为对应的整数表示,而 chr() 函数则可将 Unicode 码点整数转换为对应字符的字符串。
>>> ord(牛')
29275
>>> chr (29275)
'牛'
Problem (unicode1): Understanding Unicode (1 point)
What Unicode character does
chr(0)return?
'\x00'
How does this character’s string representation
(__repr__())differ from its printed representa-tion? 这个字符的字符串表示(__repr__())与其打印表示有何不同?
__repr__ 输出无歧义的,开发者明确的字符串序列
>>> s="牛\na"
>>> s
'牛\na'
>>> print(s)
牛
a
What happens when this character occurs in text? It may be helpful to play around with the following in your Python interpreter and see if it matches your expectations:
当这个字符出现在文本中时会发生什么?在你的 Python 解释器中尝试以下代码可能会有所帮助,看看结果是否符合你的预期:
>>> chr(0)
>>> print (chr(O))
>>> "this is a test" + chr(0) + "string"
>>> print ("this is a test" + chr(O) + "string")
>>> chr(0)
'\x00'
>>> print(chr(0))
>>> "this is a test" + chr(0) + "string"
'this is a test\x00string'
>>> print("this is a test" + chr(0) + "string")
this is a teststring
2.2 Unicode Encodings
While the Unicode standard defines a mapping from characters to code points (integers), it’s impractical to train tokenizers directly on Unicode codepoints, since the vocabulary would be prohibitively large (around 150K items) and sparse (since many characters are quite rare). Instead, we’ll use a Unicode encoding, which converts a Unicode character into a sequence of bytes. The Unicode standard itself defines three encodings: UTF-8, UTF-16, and UTF-32, with UTF-8 being the dominant encoding for the Internet (more than 98% of all webpages).
虽然 Unicode 标准定义了从字符到码点(整数)的映射,但直接基于 Unicode 码点训练分词器是不切实际的,因为词汇表将会过于庞大(约 15 万个条目)且稀疏(因为许多字符非常罕见)。因此,我们将使用 Unicode 编码,它将 Unicode 字符转换为字节序列。Unicode 标准本身定义了三种编码:UTF-8、UTF-16 和 UTF-32,其中 UTF-8 是互联网的主导编码(超过 98% 的网页使用它)。
To encode a Unicode string into UTF-8, we can use the encode() function in Python. To access the underlying byte values for a Python bytes object, we can iterate over it (e.g., call list()). Finally, we can use the decode() function to decode a UTF-8 byte string into a Unicode string.
要将 Unicode 字符串编码为 UTF-8,我们可以使用 Python 中的 encode()函数。要访问 Python 字节对象的底层字节值,我们可以对其进行迭代(例如,调用 list())。最后,我们可以使用 decode()函数将 UTF-8 字节字符串解码为 Unicode 字符串。
>>> test_string = "hello! こんにちは!"
>>> utf8_encoded = test_string.encode("utf-8")
>>> print(utf8_encoded)
b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!'
>>> list(utf8_encoded)
[104, 101, 108, 108, 111, 33, 32, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 33]
>>> print(len(test_string))
13
>>> print(len(utf8_encoded))
23
>>> print(utf8_encoded.decode("utf-8"))
hello! こんにちは!
By converting our Unicode codepoints into a sequence of bytes (e.g., via the UTF-8 encoding), we are essentially taking a sequence of codepoints (integers in the range 0 to 154,997) and transforming it into a sequence of byte values (integers in the range 0 to 255). The 256-length byte vocabulary is much more manageable to deal with. When using byte-level tokenization, we do not need to worry about out-of-vocabulary tokens, since we know that any input text can be expressed as a sequence of integers from 0 to 255.
通过将 Unicode 码点转换为字节序列(例如通过 UTF-8 编码),我们本质上是在将码点序列(取值范围为 0 到 154,997 的整数)转换为字节值序列(取值范围为 0 到 255 的整数)。长度为 256 的字节词汇表处理起来要容易得多。使用字节级分词时,我们无需担心未登录词问题,因为我们知道任何输入文本都可以表示为 0 到 255 范围内的整数序列。
What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various input strings.
有哪些理由让我们更倾向于在 UTF-8 编码的字节上训练分词器,而不是 UTF-16 或 UTF-32?比较这些编码针对不同输入字符串的输出结果可能会有所启发。
UTF-8 的内存占用较小,能让程序更加高效
Consider the following (incorrect) function, which is intended to decode a UTF-8 byte string into a Unicode string. Why is this function incorrect? Provide an example of an input byte string that yields incorrect results.
请思考以下这个(存在错误的)函数,其设计意图是将 UTF-8 字节字符串解码为 Unicode 字符串。这个函数为何是错误的?请提供一个会导致错误结果的输入字节串示例。
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
return "" .join([bytes ([b]).decode("utf-8") for b in bytestring])
>>> decode_utf8_bytes_to_str_wrong("hel1o" .encode("utf-8"))
'hello'
这段代码没有正确处理多字符的 UTF-8 字符
在 UTF-8 中,分为单字节和多字节字节
- 单字节字符(ASCII):0xxxxxxx(最高位为 0)
- 多字节字符:由 2-4 个字节组成,第一个字节的最高位有几个连续的 1 表示总字节数
这个函数在处理多字节字符的时候会产生乱码
Give a two byte sequence that does not decode to any Unicode character(s).
请给出一个无法解码为任何 Unicode 字符的双字节序列。
invalid_two_byte = b'\xC0\x80'
- 第一个字节:0xC0= 11000000
- 第二个字节:0x80= 10000000
- 解码后的码点:000 00000000= U+0000(NULL字符)
UTF-8规定码点 U+0000应该编码为单字节 0x00,使用双字节编码 C0 80是过度编码,这是明确禁止的。
2.3 Subword Tokenization
While byte-level tokenization can alleviate the out-of-vocabulary issues faced by word-level tokenizers, tokenizing text into bytes results in extremely long input sequences. This slows down model training, since a sentence with 10 words might only be 10 tokens long in a word-level language model, but could be 50 or more tokens long in a character-level model (depending on the length of the words). Processing these longer sequences requires more computation at each step of the model. Furthermore, language modeling on byte sequences is difficult because the longer input sequences create long-term dependencies in the data.
虽然字节级分词可以缓解词级分词器面临的未登录词问题,但将文本分割为字节会导致输入序列变得极长。这会拖慢模型训练速度,因为一个包含10个单词的句子在词级语言模型中可能只有10个标记,但在字符级模型中可能长达50个甚至更多标记(具体取决于单词长度)。处理这些更长的序列需要在模型的每个步骤中进行更多计算。此外,基于字节序列的语言建模颇具挑战性,因为更长的输入序列会在数据中产生长期依赖关系。
Subword tokenization is a midpoint between word-level tokenizers and byte-level tokenizers. Note that a byte-level tokenizer’s vocabulary has 256 entries (byte values are 0 to 225). A subword tokenizer trades-off a larger vocabulary size for better compression of the input byte sequence. For example, if the byte sequence b'the' often occurs in our raw text training data, assigning it an entry in the vocabulary would reduce this 3-token sequence to a single token.
子词分词是词级分词器和字节级分词器之间的折中方案。需要注意的是,字节级分词器的词汇表仅有256个条目(字节值范围为0到255)。子词分词器通过增大词汇表规模来换取对输入字节序列更好的压缩效果。例如,如果字节序列 b'the' 经常出现在原始文本训练数据中,将其设为词汇表中的一个条目就能把这个3标记的序列压缩为单个标记。
How do we select these subword units to add to our vocabulary? Sennrich et al. [2016] propose to use byte-pair encoding (BPE; Gage, 1994), a compression algorithm that iteratively replaces (“merges”) the most frequent pair of bytes with a single, new unused index. Note that this algorithm adds subword tokens to our vocabulary to maximize the compression of our input sequences—if a word occurs in our input text enough times, it’ll be represented as a single subword unit.
我们如何选择这些子词单元并将其加入词汇表?Sennrich 等人 [2016] 提出使用字节对编码(BPE;Gage, 1994),这是一种通过迭代方式将出现频率最高的字节对替换(“合并”)为单个新索引的压缩算法。需要注意的是,该算法通过向词汇表中添加子词标记来最大化输入序列的压缩效率——如果一个单词在输入文本中出现次数足够多,它就会被表示为单个子词单元。
Subword tokenizers with vocabularies constructed via BPE are often called BPE tokenizers. In this assignment, we’ll implement a byte-level BPE tokenizer, where the vocabulary items are bytes or merged sequences of bytes, which give us the best of both worlds in terms of out-of-vocabulary handling and man-ageable input sequence lengths. The process of constructing the BPE tokenizer vocabulary is known as “training” the BPE tokenizer.
采用通过BPE构建词汇表的子词分词器通常被称为BPE分词器。在本作业中,我们将实现一个字节级BPE分词器,其词汇表项为字节或合并后的字节序列。这种方案在应对未登录词与管理输入序列长度两方面实现了优势兼顾。构建BPE分词器词汇表的过程被称为BPE分词器的"训练"。
2.4 BPE Tokenizer Training
The BPE tokenizer training procedure consists of three main steps.
BPE分词器的训练流程包含三个主要步骤。
Vocabulary initialization The tokenizer vocabulary is a one-to-one mapping from bytestring token to integer ID. Since we’re training a byte-level BPE tokenizer, our initial vocabulary is simply the set of all bytes. Since there are 256 possible byte values, our initial vocabulary is of size 256.
词汇表初始化 分词器词汇表是一个从字节串标记到整数ID的一对一映射。由于我们训练的是字节级BPE分词器,初始词汇表就是所有字节的集合。由于存在256种可能的字节值,我们的初始词汇表大小为256。
Pre-tokenization Once you have a vocabulary, you could, in principle, count how often bytes occur next to each other in your text and begin merging them starting with the most frequent pair of bytes. However, this is quite computationally expensive, since we’d have to go take a full pass over the corpus each time we merge. In addition, directly merging bytes across the corpus may result in tokens that differ only in punctuation (e.g., dog! vs. dog.). These tokens would get completely different token IDs, even though they are likely to have high semantic similarity (since they differ only in punctuation).
预分词 一旦有了词汇表,理论上你可以统计文本中相邻字节的出现频率,并从出现频率最高的字节对开始进行合并。然而,这种操作的计算成本非常高,因为每次合并都需要对整个语料库进行完整扫描。此外,直接跨语料库合并字节可能会导致仅因标点符号不同而产生差异的标记(例如 dog! 和 dog.)。即使这些标记很可能具有高度的语义相似性(因为它们仅标点符号不同),它们也会被分配完全不同的标记ID。
To avoid this, we pre-tokenize the corpus. You can think of this as a coarse-grained tokenization over the corpus that helps us count how often pairs of characters appear. For example, the word 'text' might be a pre-token that appears 10 times. In this case, when we count how often the characters ‘t’ and ‘e’ appear next to each other, we will see that the word ‘text’ has ‘t’ and ‘e’ adjacent and we can increment their count by 10 instead of looking through the corpus. Since we’re training a byte-level BPE model, each pre-token is represented as a sequence of UTF-8 bytes.
为避免这种情况,我们需要对语料库进行预分词。你可以将其视为一种在语料库上进行的粗粒度分词,它帮助我们统计字符对出现的频率。例如,单词 'text' 可能作为一个预标记出现了 10 次。在这种情况下,当我们统计字符 't' 和 'e' 相邻出现的次数时,我们会发现单词 'text' 中 't' 和 'e' 是相邻的,因此我们可以直接将它们的计数增加 10,而无需重新扫描整个语料库。由于我们训练的是字节级 BPE 模型,每个预标记都表示为 UTF-8 字节序列。
The original BPE implementation of Sennrich et al. [2016] pre-tokenizes by simply splitting on whitespace (i.e., s.split(" ")). In contrast, we’ll use a regex-based pre-tokenizer (used by GPT-2; Radford et al., 2019) from github.com/openai/tiktoken/pull/234/files:
Sennrich 等人 [2016] 的原始 BPE 实现仅通过空白字符进行分割(即 s.split(" "))来完成预分词。与之不同,我们将采用基于正则表达式的预分词器(由 GPT-2 使用;Radford 等人,2019),该实现源自 github.com/openai/tiktoken/pull/234/files:
>>> PAT = r"""'(?:[sdmt]|11|velre)| ?\p{L}+| ?\p{N}+| ?[~\s\p{L}\p{N}]+|\s+(?!\S) |\s+"""
It may be useful to interactively split some text with this pre-tokenizer to get a better sense of its behavior:
通过交互式地使用此预分词器分割一些文本,可能有助于更直观地理解其具体行为:
>>> # reguires 'reger package
>>> import regex as re
>>> re.findal1(PAT,"some text that i'11 pre-tokenize")
['some','text','that','i',"'11",'pre','-','tokenize']
When using it in your code, however, you should use re.finditer to avoid storing the pre-tokenized words as you construct your mapping from pre-tokens to their counts.
但在代码实现时,您应使用 re.finditer 方法,以避免在构建预分词标记到其计数的映射过程中存储这些预分词结果。
Compute BPE merges Now that we’ve converted our input text into pre-tokens and represented each pre-token as a sequence of UTF-8 bytes, we can compute the BPE merges (i.e., train the BPE tokenizer). At a high level, the BPE algorithm iteratively counts every pair of bytes and identifies the pair with the highest frequency (“A”, “B”). Every occurrence of this most frequent pair (“A”, “B”) is then merged, i.e., replaced with a new token “AB”. This new merged token is added to our vocabulary; as a result, the final vocabulary after BPE training is the size of the initial vocabulary (256 in our case), plus the number of BPE merge operations performed during training. For efficiency during BPE training, we do not consider pairs that cross pre-token boundaries.2 When computing merges, deterministically break ties in pair frequency by preferring the lexicographically greater pair. For example, if the pairs (“A”, “B”), (“A”, “C”), (“B”, “ZZ”), and (“BA”, “A”) all have the highest frequency, we’d merge (“BA”, “A”):
BPE合并 现在我们已经将输入文本转换为预分词标记,并将每个预分词标记表示为 UTF-8 字节序列,接下来就可以计算 BPE合并(即训练BPE分词器)。从高层次来看,BPE 算法会迭代统计每个字节对,并识别出现频率最高的字节对("A", "B")。然后将这个最频繁出现的字节对("A", "B")的每个实例进行合并,即替换为一个新标记"AB"。这个新合并的标记会被添加到我们的词汇表中;因此,BPE训练后的最终词汇表大小等于初始词汇表大小(本例中为256)加上训练过程中执行的BPE合并操作次数。为了提高BPE训练效率,我们不考虑跨越预分词边界的字节对。在计算合并时,通过优先选择字典序更大的字节对来确定性解决频率相同的情况。例如,如果字节对("A", "B")、("A", "C")、("B", "ZZ")和("BA", "A")都具有最高频率,我们将合并("BA", "A"):
>>> max([("A","B"),("A","C"),("B","ZZ"),("BA","A")])
('BA','A')
Special tokens Often, some strings (e.g., <|endoftext|>) are used to encode metadata (e.g., boundaries between documents). When encoding text, it’s often desirable to treat some strings as “special tokens” that should never be split into multiple tokens (i.e., will always be preserved as a single token). For example, the end-of-sequence string <|endoftext|> should always be preserved as a single token (i.e., a single integer ID), so we know when to stop generating from the language model. These special tokens must be added to the vocabulary, so they have a corresponding fixed token ID.
特殊标记 某些字符串(例如 <|endoftext|>)通常用于编码元数据(如文档之间的边界)。在编码文本时,通常需要将一些字符串视为"特殊标记",这些标记永远不应被拆分为多个标记(即始终保留为单个标记)。例如,序列结束字符串 <|endoftext|>应始终作为单个标记(即单个整数ID)保留,以便我们知道何时停止语言模型的生成。这些特殊标记必须被添加到词汇表中,以便它们有对应的固定标记ID。
Algorithm 1 of Sennrich et al. [2016] contains an inefficient implementation of BPE tokenizer training (essentially following the steps that we outlined above). As a first exercise, it may be useful to implement and test this function to test your understanding.
Sennrich等人[2016]的算法1包含了一个效率较低的BPE分词器训练实现(基本上遵循了我们上面概述的步骤)。作为第一个练习,实现并测试这个函数可能有助于检验你的理解程度。
Example (bpe_example): BPE training example
Here is a stylized example from Sennrich et al. [2016]. Consider a corpus consisting of the following text
以下是Sennrich等人[2016]论文中的一个典型示例。假设某个语料库包含以下文本内容:
low low low low low
lower lower widest widest widest
newest newest newest newest newest newest
and the vocabulary has a special token <|endoftext|>.
并且该词汇表包含一个特殊标记 <|endoftext|>。
Vocabulary We initialize our vocabulary with our special token <|endoftext|> and the 256 byte values.
词汇表:我们使用特殊标记 <|endoftext|>和 256 个字节值来初始化我们的词汇表。
Pre-tokenization For simplicity and to focus on the merge procedure, we assume in this example that pretokenization simply splits on whitespace. When we pretokenize and count, we end up with the frequency table.
预分词:为简化流程并聚焦于合并操作,本例中我们假设预分词仅通过空白字符进行分割。完成预分词和词频统计后,我们得到以下频率统计表。
{low:5,lower: 2,widest: 3, newest:6}
It is convenient to represent this as a dict[tuple[bytes], int], e.g. {(l,o,w): 5 …}. Note that even a single byte is a bytes object in Python. There is no byte type in Python to represent a single byte, just as there is no char type in Python to represent a single character.
将其表示为 dict[tuple[bytes], int]的形式会非常方便,例如 {(l,o,w): 5 …}。需要注意的是,即使单个字节在 Python 中也是一个 bytes 对象。Python 中没有专门的 byte 类型来表示单个字节,就像没有 char 类型来表示单个字符一样。
Merges We first look at every successive pair of bytes and sum the frequency of the words where they appear {lo: 7, ow: 7, we: 8, er: 2, wi: 3, id: 3, de: 3, es: 9, st: 9, ne: 6, ew: 6}. The pair ('es') and ('st') are tied, so we take the lexicographically greater pair, ('st'). We would then merge the pre-tokens so that we end up with {(l,o,w): 5, (l,o,w,e,r): 2, (w,i,d,e,st): 3, (n,e,w,e,st): 6}.
合并操作:我们首先统计所有相邻字节对,并累计它们在词汇中出现的频率:{lo: 7, ow: 7, we: 8, er: 2, wi: 3, id: 3, de: 3, es: 9, st: 9, ne: 6, ew: 6}。字节对 ('es') 和 ('st') 频率相同,因此我们选择字典序更大的 ('st') 进行合并。随后合并预分词标记,最终得到 {(l,o,w): 5, (l,o,w,e,r): 2, (w,i,d,e,st): 3, (n,e,w,e,st): 6}。
In the second round, we see that (e, st) is the most common pair (with a count of 9) and we would merge into {(l,o,w): 5, (l,o,w,e,r): 2, (w,i,d,est): 3, (n,e,w,est): 6}. Continuing this, the sequence of merges we get in the end will be ['s t', 'e st', 'o w', 'l ow', 'w est', 'n e', 'ne west', 'w i', 'wi d', 'wid est', 'low e', 'lowe r'].
在第二轮中,我们发现 (e, st) 是最常见的字节对(出现次数为9),合并后将得到 {(l,o,w): 5, (l,o,w,e,r): 2, (w,i,d,est): 3, (n,e,w,est): 6}。继续此过程,最终得到的合并序列将是 ['s t', 'e st', 'o w', 'l ow', 'w est', 'n e', 'ne west', 'w i', 'wi d', 'wid est', 'low e', 'lowe r']。
If we take 6 merges, we have ['s t', 'e st', 'o w', 'l ow', 'w est', 'n e'] and our vocab-ulary elements would be [<|endoftext|>, [...256 BYTE CHARS], st, est, ow, low, west, ne].
如果我们进行6次合并操作,将得到 ['s t', 'e st', 'o w', 'l ow', 'w est', 'n e'],此时我们的词汇表元素将包含 [<|endoftext|>, [...256个字节字符], st, est, ow, low, west, ne]。
With this vocabulary and set of merges, the word newest would tokenize as [ne, west].
使用这个词汇表和合并规则集,单词"newest"将被分词为[ne, west]。
2.5 Experimenting with BPE Tokenizer Training
Let’s train a byte-level BPE tokenizer on the TinyStories dataset. Instructions to find / download the dataset can be found in Section 1. Before you start, we recommend taking a look at the TinyStories dataset to get a sense of what’s in the data.
让我们在 TinyStories 数据集上训练一个字节级 BPE 分词器。数据集的查找/下载说明可在第 1 节中找到。开始之前,建议先查看 TinyStories 数据集以了解数据内容。
Parallelizing pre-tokenization You will find that a major bottleneck is the pre-tokenization step. You can speed up pre-tokenization by parallelizing your code with the built-in library multiprocessing. Con-cretely, we recommend that in parallel implementations of pre-tokenization, you chunk the corpus while ensuring your chunk boundaries occur at the beginning of a special token. You are free to use the starter code at the following link verbatim to obtain chunk boundaries, which you can then use to distribute work across your processes:
并行化预分词 你会发现主要瓶颈在于预分词步骤。你可以通过使用内置库 multiprocessing 并行化代码来加速预分词。具体而言,我们建议在并行化预分词实现中,对语料库进行分块处理,同时确保分块边界出现在特殊标记的开头处。你可以直接使用以下链接中的初始代码来获取分块边界,然后利用这些边界将工作分配到各个进程中:
This chunking will always be valid, since we never want to merge across document boundaries. For the purposes of the assignment, you can always split in this way. Don’t worry about the edge case of receiving
a very large corpus that does not contain <|endoftext|>.
这种分块方式始终有效,因为我们从不希望跨文档边界进行合并。就本作业而言,你完全可以采用这种方式进行分割。无需担心遇到不包含 <|endoftext|>的超大语料库这种边缘情况。
Removing special tokens before pre-tokenization Before running pre-tokenization with the regex pattern (using re.finditer), you should strip out all special tokens from your corpus (or your chunk, if using a parallel implementation). Make sure that you split on your special tokens, so that no merging can occur across the text they delimit. For example, if you have a corpus (or chunk) like [Doc 1]<|endoftext|>[Doc 2], you should split on the special token <|endoftext|>, and pre-tokenize [Doc 1] and [Doc 2] separately, so that no merging can occur across the document boundary. This can be done using re.split with "|".join(special_tokens) as the delimiter (with careful use of re.escape since | may occur in the special tokens). The test test_train_bpe_special_tokens will test for this.
预分词前移除特殊标记 在使用正则表达式模式(通过 re.finditer)进行预分词之前,你应当从语料库(或分块,如果使用并行实现)中剔除所有特殊标记。请确保按照特殊标记进行分割,以防止在它们分隔的文本之间发生合并。例如,如果你的语料库(或分块)形如 [文档1]<|endoftext|>[文档2],你应当根据特殊标记 <|endoftext|>进行分割,并分别对 [文档1] 和 [文档2] 进行预分词,以确保不会跨越文档边界进行合并。这可以通过使用 re.split 并以内建的 "|".join(special_tokens) 作为分隔符来实现(由于 |可能出现在特殊标记中,需要谨慎使用 re.escape)。测试用例 test_train_bpe_special_tokens 将对此进行验证。
Optimizing the merging step The naïve implementation of BPE training in the stylized example above is slow because for every merge, it iterates over all byte pairs to identify the most frequent pair. However, the only pair counts that change after each merge are those that overlap with the merged pair. Thus, BPE training speed can be improved by indexing the counts of all pairs and incrementally updating these counts, rather than explicitly iterating over each pair of bytes to count pair frequencies. You can get significant speedups with this caching procedure, though we note that the merging part of BPE training is not parallelizable in Python.
优化合并步骤 上述示例中的BPE训练朴素实现效率较低,因为每次合并都需要遍历所有字节对以识别最高频对。然而,每次合并后实际发生变化的只有与已合并对有重叠的配对计数。因此,可以通过建立所有配对的计数索引并增量更新这些计数(而非显式遍历每个字节对进行频率统计)来提升BPE训练速度。采用这种缓存策略可获得显著加速,但需要注意的是BPE训练的合并环节在Python中无法并行化。
Low-Resource/Downscaling Tip: Profiling
You should use profiling tools like cProfile or scalene to identify the bottlenecks in your imple-mentation, and focus on optimizing those.
您应该使用性能分析工具(如 cProfile 或 scalene)来定位实现中的性能瓶颈,并重点优化这些部分。
Low-Resource/Downscaling Tip: “Downscaling”
我们建议您不要直接在整个 TinyStories 数据集上训练分词器,而是先在一个小的数据子集(即“调试数据集”)上进行训练。例如,您可以在 TinyStories 验证集上训练分词器,该集合包含 2.2 万篇文档,而非完整的 212 万篇。这体现了一个通用策略:尽可能通过缩减规模来加速开发,例如使用更小的数据集、更小的模型规模等。选择调试数据集的大小或超参数配置需要仔细考量:您希望调试集足够大,能够复现完整配置中的瓶颈问题(这样您所做的优化才具有普适性),但又不能太大,以免运行时间过长。