Talk:Consistent Overhead Byte Stuffing
This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
Consistent Overhead Byte Stuffing—Reduced (COBS/R)
[edit]I've made a small improvement to the COBS algorithm, and called it Consistent Overhead Byte Stuffing—Reduced (COBS/R). I'm not sure if it's "notable" enough for Wikipedia, though. Would it be worthwhile to publish it, e.g. to arXiv? Cmcqueen1975 (talk) 21:15, 23 November 2010 (UTC)
Python and C implementations
[edit]I've made a Python implementation of COBS and COBS/R. Also a C implementation, which is more robust against buffer overflows etc. I have no idea about implementations in other languages. Could a list of implementations in various languages be added to the article? Cmcqueen1975 (talk) 21:19, 23 November 2010 (UTC)
- The change to prevent stray trailing NUL bytes and buffer overflows is fairly straightforward. Before assigning the NUL, compare the destination of the assignment to the end pointer. The quoted algorithm for encoding has issues as well. I'd re-write all the code, but don't want to plotz a glob of C into Wikipedia ... Tall Girl (talk) 01:58, 3 September 2014 (UTC)
Other References
[edit]I found this to be a more readable reference for COBS:
Consistent Overhead Byte Stuffing, Stuart Cheshire and Mary Baker, IEEE/ACM Transactions on Networking, Vol. 7, No. 2, April 1999. Cmcqueen1975 (talk) 22:40, 23 November 2010 (UTC)
C algorithm issue?
[edit]I'm not a C dev, but I'm pretty sure the last conversion example will be encoded incorrectly by the presented C algorithm. In case of an end-block with 0xFE non-null bytes, the FinishBlock will be executed twice, resulting in an extra 0x01 byte.
- The example C code also has a sequence point issue in its FinishBlock macro definition as code_ptr is both written and read. — Preceding unsigned comment added by 88.130.49.159 (talk) 14:17, 10 December 2017 (UTC)
- This will translate in an extra 0 byte in the result. I have covered this issue in detail below, but the root cause of the error is that group header bytes should only be added if there is remaining data to be encoded. Adding a group header byte because you break the 254 byte boundary, with no data left, will exhibit this behavior. This only happens, though, if the last byte in the group is 0, and the group is exactly 254 bytes. Arhax (talk) 21:00, 27 April 2018 (UTC)
The other linked C algorithm does an actual check on this, by breaking the while loop before the code == 0xFF check, in case the complete source lenght has been processed. — Preceding unsigned comment added by BasiKobrA (talk • contribs) 14:39, 14 July 2015 (UTC)
The example code generates different results for test case 6 and 7. — Preceding unsigned comment added by 190.5.48.10 (talk) 01:35, 9 June 2020 (UTC)
Encoding Description Issues
[edit]The encoding example's "red" and "green" description seem inconsistent with the algorithm described in the (second) referenced document. Both are examples of "length" codes, as per the document's Table 1.
The "red" description seems like it could not be used to describe a packet that starts with a 0-byte.
Similarly, the "green" seems like it could only be use to describe a sequence that begins with 0-byte.
The original description in the document, the "length" codes describe a run of non-zero data followed by a single zero.
The description in the article seems like an attempt to avoid introducing the concept of the trailing 0-byte, but it introduces other problems. — Preceding unsigned comment added by 24.19.250.237 (talk) 09:32, 13 December 2015 (UTC)
- Regarding packets that start with a 0-byte: see examples 1 and 2, which specifically address this situation. As for the "trailing 0-byte" on every packet, I assume you mean the "packet delimiter", which is introduced and explained in the lede and referenced throughout the article. BTW, please note that the delimiter is not required to be zero (it may be any particular byte value), but is usually zero to improve efficiency. Your observation that both "red" and "green" values indicates length is correct. However, a "red" byte differs from "green" bytes because it is required at the beginning of every packet (regardless of whether the first data byte is zero or not) and, unlike "green" bytes, it does not replace a data byte; it is an additional byte that must always be prepended to the packet -- this is why it's called "overhead" (and why "green" bytes are not overhead). Please explain your concerns more clearly. In particular, why do you think the "red" and "green" descriptions are inconsistent with earlier sections? And what more could be said in the lede to "introduce the concept" of the packet delimiter? Lambtron (talk) 17:20, 13 December 2015 (UTC)
- Examples 1 and 2 seem incorrect to me. "00" would be encoded simply as "01 <00>", where "<00>" is the packet delimiter. "00 00" would be encoded as "01 01 <00>".EnterYourUsernameWasTaken (talk) 06:38, 23 February 2016 (UTC)
- To see why examples 1 and 2 are correct, simply apply the algorithm. Lambtron (talk) 16:00, 23 February 2016 (UTC)
- I think you are correct. I missed the part in the original paper about the logically appended zero also being encoded.EnterYourUsernameWasTaken (talk) 18:49, 23 February 2016 (UTC)
- You actually were correct. Examples 1 and 2 are wrong. If you read the paper they explicitly state that after decoding a trailing zero is to be removed. The example C program will encode "00" to "01 01 00" as it should.Blackclaws (talk) 14:45, 8 April 2017 (UTC)
- I think you are correct. I missed the part in the original paper about the logically appended zero also being encoded.EnterYourUsernameWasTaken (talk) 18:49, 23 February 2016 (UTC)
- To see why examples 1 and 2 are correct, simply apply the algorithm. Lambtron (talk) 16:00, 23 February 2016 (UTC)
- Examples 1 and 2 seem incorrect to me. "00" would be encoded simply as "01 <00>", where "<00>" is the packet delimiter. "00 00" would be encoded as "01 01 <00>".EnterYourUsernameWasTaken (talk) 06:38, 23 February 2016 (UTC)
Poor description
[edit]I think the overall description of this algorithm is actually pretty poor. The issue shows up when you examine examples 6 and 1 together. Based on the version of the paper on Stuart Cheshire's own website, Blackclaws turns out to be wrong in the preceding section; there is no trailing 01 in the COBS-encoded packet because this version of the paper explicitly mentions a "small optimization" where a 0xFF data block may omit the encoding of its implicit trailing zero because the packet delimiter allows the decoder to determine that it can skip that implicit trailing zero (search for "small optimization" in the paper). This "small optimization" section is not present in the sigcomm.org version. But knowing that we can use the packet delimiter to let us omit some of the COBS encoding means we can simplify the COBS packet of example 1 to simply be {01 00} rather than {01 01 00} as shown on the page now. The only purpose that second 01 serves is to encode the implicit trailing zero. And if we're allowed to infer the original content from the packet delimiter, then there is no need for that second 01, just like there is no need for the trailing 01 in example 6 either.
This brings up the fact that the C code (which is a near-verbatim reproduction of the code in the paper) for StuffData does not actually implement this "small optimization"; it will write a trailing 01 to example 6 as noted in a previous section on this Talk page.
I think the fundamental issue is that there are two possible implementations: one relies on optimization made possible by the packet delimiters and one doesn't. Unfortunately, this page is not consistent about which implementation is being described, and neither does the stuartcheshire.org version of the paper itself. The C code, example 1, and example 2 all do not implement the optimization. Example 6 does implement that optimization.
This leads to confusion in the "Packet framing and stuffing" section. This section uses the terminology "data set" which does not appear in the paper. Instead, the paper uses "block" (something which is limited to 254 bytes) and "chunk" (something which has the zero-byte packet delimiter appended to the end of it). "data set" means neither of these things because a "chunk" is not limited to 254 bytes and a "block" only has a zero byte appended when the chunk size is under 255 bytes.
Furthermore, it's false that "The overhead of COBS encoding is constant regardless of data content." The paper says that "All packets up to 254 bytes in length are encoded with an overhead of exactly one byte." The overhead of larger packets does vary with data content, but the maximum is 1 byte in 254.
I'm hesitant to jump in and rewrite a bunch of stuff, but that's what I'm inclined to do unless someone has any objections here. Bjp716 (talk) 23:51, 29 November 2017 (UTC)
- I'd go ahead and change it. I've looked at the exact same thing before and stumbled over the same inconsistencies. It's wrong as it stands, so it really needs fixing. If you can spare the time to do that, you'll be doing the world a service. MichiHenning (talk) 03:11, 2 December 2017 (UTC)
- Regarding this edit: https://wiki.riteme.site/w/index.php?title=Consistent_Overhead_Byte_Stuffing&curid=29719643&diff=814297276&oldid=800788037 The preprocessor hackery makes it quite difficult to actually comprehend what is going on. A good step would be to illustrate the algorithm in code that doesn't try to be clever. People who want to implement the algorithm can apply whatever optimisations they deem appropriate. MichiHenning (talk) 11:32, 9 December 2017 (UTC)
Overhead is not constant
[edit]@Lambtron: The "C" stands for "consistent", meaning not wildly varying, but it's not constant. As the source paper clearly explains, the maximum is ⌈n/254⌉ and the minimum is 1.
This is exactly one byte for 254 bytes or less, but the encoding overhead for 255 bytes can vary between 1 and ⌈255/254⌉ = 2 bytes depending on whether one of the first 254 is non-zero or not:
00 01 02 ... FC FD FE → 01 FF 01 02 ... FC FD FE 00 (255 bytes to 256, plus the trailing 00) 01 02 03 ... FD FE FF → FF 01 02 03 ... FD FE 02 FF 00 (255 bytes to 257, plus the trailing 00) 02 03 04 ... FE FF 00 → FF 02 03 04 ... FE FF 01 01 00 (255 bytes to 257, plus the trailing 00) 03 04 05 ... FF 00 01 → FE 03 04 05 ... FF 02 01 00 (255 bytes to 256, plus the trailing 00)
Any run of 254 non-zero bytes which is not at the end of the packet will force an additional overhead byte. (Equivalently, any prefix byte of FF which is not at the end of the packet adds an extra overhead byte.)
This should also be obvious on information-theoretic grounds. Given an input string of 1416 bytes, there are 2561416 = 211328 ≈ 103410.06779 possible input strings. If encoding overhead were fixed at one byte, there would be only 2551417 ≈ 211327.99882 ≈ 103410.06743 possible output strings. This is not enough to unambiguously encode all possible inputs. The first number is larger than the second by a factor of 1.00081822814, or a difference of 9.556746×103406. 23.83.37.241 (talk) 23:37, 17 March 2018 (UTC)
@Lambtron: Since 12 days (and 31 unrelated edits) have passed with no response, I'm reinstating my edits. I'll try to straighten out the confusion about zero bytes vs. delimiters; although it didn't come out well, that was an attempt at improving the article: making clear that any byte can be reserved as a delimiter byte; zero is just an example which makes the explanation particularly simple. 23.83.37.241 (talk) 02:15, 30 March 2018 (UTC)
Example 9 is incorrect
[edit]Example 9, provided here:
02 03 04 ... FE FF 00 end → FF 02 03 04 ... FE FF 01 01 00
is incorrect. It shows the beginning and end of the input and output. I will be using end to show where the decoded data ends (ie, the length of the resulting buffer)
The short explanation is that there is an extra group separator byte at the end of the data, that should not exist.
The last few bytes of the encoded output, FE FF 01 01 00 will decode to FE FF 00 end. Because one 01 followed by a 00 signifies the input packet ends with a 0 (0 end), the sequence FE FF 01 01 00 will translate to FE FF 00 00 end in the resulting packet. This is different than the input data. This can more easily be understood by comparing example 9 with examples 1 and 2.
The correct output is FE FF 01 00 , since this is the sequence that will translate to the desired input FE FF 00 end. I have also made this edit on the page.
Since I discovered this issue while developing a COBS library for myself, and used these examples (along with the paper's examples) as my reference, and I'm likely not the only one doing it, I have to wonder, how many other libraries out there implement this badly?
I also understand why the software that generated those examples might have failed, having encountered this error in a previous example myself: The first input group is 254 long, so after copying 254 input bytes it has to append a new new group header byte (this also has to be tracked, so at the next group header byte, or at the end of the packet, can be rewritten with the group length). But at this point, there is no data left to encode, so now we mark the packet end, by appending a 00 at the end of the output data, and replacing the previous group header byte with the length of the packet. In this case, the previous packet is 0, so the new group header byte should become the end of packet marker. But the execution of the code disregards this edge case, adds the group separator anyway, expects more data, which does not exist, keeps the group separator and then adds the end of packet marker.
If you add a data byte from input to output, but you have to generate the group header byte (because you exceeded 254 bytes), you also have to check if there are more characters left; if there aren't, delete the group header byte (or don't add it, or make it the end of packet marker). The easy way out, code-wise, is to add the group header byte in the code branch which checks if the input byte is not 0, before actually appending the input data. This makes sure that no group header byte is added if there is no data left. --Arhax (talk) 20:41, 27 April 2018 (UTC)
- @Arhax: H'm... let me work it through. I was trying to construct an example of where two overhead bytes are required, but I may have screwed up.
- Input data: 02 ... FE FF 00 end
- Prefixed block algorithm...
- Append a single zero byte: 02 ... FE FF 00 00 end
- Break into blocks ending with a zero byte or 254 non-zero bytes: 02 ... FE FF / 00 / 00 end
- For each block, delete the trailing zero byte and add a length prefix: FF 02 ... FE FF 01 01 end
- (Finally, we can replace end with 00.)
- Linked list algorithm.
- Insert a zero byte at the beginning, and after every instance of 254 non-zero bytes: 00 02 ... FE FF 00 00 end
- Replace all zero bytes with offset to the next zero byte, or the end of the packet: FF 02 ... FE FF 01 01 end
- I think this is properly implementing the algorithm as described, which in turn corresponds to the original source paper. Now, you may have found a case where it's possible to change the algorithm to strip off an extra byte. But by the documented algorithm, FF 02 ... FE FF 01 00 is a redundant encoding for 02 ... FE FF end.
- Let's look at a few variants to make it clear. (I'm omitting the end delimiter for simplicity.)
- If we change the final 00 to 01, the encoding is FF 02 ... FE FF 02 01. Two overhead bytes (255 encoded to 257).
- If we keep the 00 and append a 01, the encoding is FF 02 ... FE FF 01 02 01. Also two overhead bytes (256 encoded to 258). The 01 is required to distinguish this from the previous case.
- If we change the final 00 to 01 00, the encoding is FF 02 ... FE FF 02 01 00. Also two overhead bytes.
- Given that all of these variants require two overhead bytes, why is it legal to omit that byte in the example 9 case?
- I think you're right that if the packet ends with exactly 254 non-zero bytes followed by any number of zeros (but no more non-zero bytes until end of packet), it's possible to unambiguously avoid adding an extra byte, but that requires an actual change to the algorithm.
- That change would also imply that a packet consisting of only k 00 bytes would encode to k 01 bytes, and not k+1 as currently documented. So either example 9 (after your edit) is wrong, or examples 1 and 2 (which you have not edited) are wrong. They can't all be right.
- I'll go back to the original paper and double-check its description. 23.83.37.241 (talk) 16:25, 30 April 2018 (UTC)
- P.S. Here's the exact wording of the original source paper applied to the example #9 input:
- "COBS first takes its input data and logically appends a single zero byte."
- 02 ... FE FF 00 00
- "COBS then locates all the zero bytes in the packet (including the added one), and divides the packet at these boundaries into one or more zero-terminated chunks. Every zero-terminated chunk contains exactly one zero byte, and that zero is always the last byte of the chunk. A chunk may be as short as one byte (i.e. a chunk containing just a solitary zero byte) or as long as an entire packet."
- 02 ... FE FF 00 / 00
- "COBS encodes each zero-terminated chunk using one or more variable length COBS code blocks. Chunks of 254 bytes or fewer are encoded as a single COBS code block. Chunks longer than 254 bytes are encoded using multiple code blocks, as described later in this section."
- So the first 255-byte chunk will be encoded as multiple code blocks, an the final chunk will be a single code block.
- "For codes 0x01 to 0xFE, the meaning of each code block is that it represents the sequence of data bytes contained within the code block, followed by an implicit zero byte. The zero byte is implicit — it is not actually contained within the sequence of data bytes in the code block. These code blocks encode data without adding any over- head. Each code block begins with one code byte followed by n data bytes, and represents n data bytes followed by one trailing zero byte. Thus the code block adds no overhead to the data: a chunk (n+1) bytes long is encoded using a code block (1+n) bytes long.
- So the final chunk is encoded as a single code block 01.
- Since it's an overhead byte to begin with, I'll write it in red: {mono|01}}
- "Chunks longer than 254 bytes are encoded using one or more of these special maximum length code blocks, followed by a single normal code block. Each special maximum length code block has no implied trailing zero, but the final (normal) code block does include an implied trailing zero at the end, so this aggregate se- quence of code blocks correctly encodes the required long chunk of non-zero data with a single zero byte at the end."
- So the first code block is encoded in two code blocks: FF 02 ... FF 01
- Concatenating the two chunks, the final encoding is FF 02 ... FF 01 01, which is what example #9 originally showed.
- I think you have a valid optimization, but it's not the algorithm described in the source paper. 23.83.37.241 (talk) 16:41, 30 April 2018 (UTC)
- I think you meant Example 10 (not 9), and I agree, it is incorrect.
- 02 03 04 ... FE FF 00 --> FF 02 03 04 ... FE FF 01 00
- Was banging my head against the wall with that one until I realized it was wrong. RayFuller (talk) 18:55, 2 November 2022 (UTC)
An old I-D describing COBS on PPP also exists
[edit]A couple of decades ago, I wrote a draft with Stuart Cheshire showing how COBS and PPP could co-exist.