Talk:Byte pair encoding
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||
|
1994?
[edit]I find it a little hard to believe that this algorithm wasn't publicly discussed until 1994. For instance, the basic algorithm was used in the US version of Final Fantasy, which was released in 1990, which can be confirmed by analyzing the game's ROM data. I cannot be certain that they used exactly the same algorithm as Gage, but it seems likely that it was close. This compression method is also found in a number of games from the 8-bit and 16-bit eras (we ROM hackers call it "DTE", for "dual-tile encoding"), though I don't know when the earliest example is from. - furrykef (Talk at me) 15:57, 31 August 2012 (UTC)
- The basic idea is simple enough to be re-invented multiple times by asm and video game developers. You can add mentions and references to similar techniques I suppose. Wqwt (talk) 03:13, 25 January 2022 (UTC)
Poor example
[edit]The example, although it technically shows the method, is unfortunate in that it does not actually result in less bytes. Jtgd (talk) 07:17, 21 April 2014 (UTC)
- Byte pair encoding doesn't work well on very short strings. You need a few byte pairs that appear several times to get a noticeable improvement, and this might need a kilobyte of text. What's a suitably repetitive string I could use? --Damian Yerrick (talk) 23:02, 11 April 2015 (UTC)
- True. The original article uses a hash table and pair table. Actually, BPE does work for short strings (with repeated pairs) because it's so simple.
- The example should be modified to state it is simplified for educational purposes. Wqwt (talk) 03:08, 25 January 2022 (UTC)
Relationship to Huffman coding
[edit]There are strong similarities between this algorithm and Huffman coding, which needs to be discussed in this article. — Preceding unsigned comment added by 66.219.222.118 (talk) 20:33, 7 October 2024 (UTC)
Use in LLMs
[edit]The use in LLMs seems important. The Large language model article refers to this one. Does this need higher priority as a support to what is said in the LLM article?
Reading the two explanations side by side, the LLM article seems to include most of what it needs. This article adds references and an example.
Assuming that this article ought to be updated to support the LLM article: (1) If I remember the culture of the time, the original algorithm was concerned with compression but not with scanning. (2) Perhaps the concept of a token could be pulled in here from the LLM article, to denote the current set of units recognized by the scanner. (3) The use of a nearly-identical algorithm in LLMs should be separated from the discussion of "modification".
I'll try to make updates based on these concepts. Bruce Esrig (talk) 12:07, 23 November 2024 (UTC)