Jump to content

Talk:Popularity of text encodings

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Proposed deletion

[edit]

Probably ok to delete this, it was created to remove a block of bloat people kept adding to the UTF-8 page, pretty much covering what is the second-most-popular encoding in the world behind UTF-8 in various countries. The rest of the text is filler I added to try to make this article have an actual subject. Something must be done to prevent people from re-adding all this to UTF-8 however. Spitzak (talk) 19:28, 16 May 2023 (UTC)[reply]

yep 217.174.52.77 (talk) 15:53, 5 August 2023 (UTC)[reply]
I think this is good as it's own topic. The distribution of text encodings is an interesting subject. 50.46.252.164 (talk) 21:29, 12 September 2023 (UTC)[reply]
I don't favor deletion. The topic is important, and, as pointed out, not appropriate to be shoved into a UTF-8 topic.
However, I'm not very certain of the data quality. The figures cited for UTF-8 are a little higher than other sources I've seen, for example. 50.46.252.164 (talk) 21:34, 12 September 2023 (UTC)[reply]

The Cyrillic Comment about Being 2x as efficient as UTF-8 is misleading

[edit]

The statement says that the native Cyrillic codepage is twice as efficient as UTF-8, however most Cyrillic websites still use UTF-8 despite that.

However, website content primarily consists of markup and tags that are not in the target language of the page. The markup is usually primarily ASCII. So, a Cyrillic web page is only very slightly less efficient in UTF-8 than a native codepage. This is true of most scripts/languages and UTF-8 vs a native codepage. 50.46.252.164 (talk) 21:28, 12 September 2023 (UTC)[reply]

The GB18030 statement is also misleading

[edit]

Typically, Chinese webpages are using GB2312/GBK, or possibly effectively Windows 936, and not GB18030. 50.46.252.164 (talk) 21:32, 12 September 2023 (UTC)[reply]

The Argument for UTF-8 over UTF-16 internally is subjective.

[edit]

"Recently it has become clear that the overhead of translating from/to UTF-8 on input and output, and dealing with potential encoding errors in the input UTF-8, vastly overwhelms any savings UTF-16 could offer" seems to be an unsupported opinion.

For example, "dealing with potential encoding errors in the input UTF-8" is just words. If the input UTF-8 is corrupt, then natively handling UTF-8 will also have to deal with the corrupted UTF-8 stream.

Additionally, most character property processing libraries, such as ICU, depend on data tables that are UTF-16. If you want to sort a bunch of Unicode strings linguistically, you're going to be converting them to UTF-1 to discover the sort weights. (or your library will need to do it for you.) Same thing if you're interested in character properties or normalization of the strings.

UTF-8 is certainly a valid choice, and good for many applications. However, I find the statement "vastly overwhelms any savings UTF-16 could offer" to be narrowminded. 50.46.252.164 (talk) 21:41, 12 September 2023 (UTC)[reply]

I don't know about "vastly", but you misunderstand the work needed to deal with corrupt UTF-8. Many programs that use UTF-8 internally can ignore corrupt data, for instance they can successfully copy a UTF-8 stream from one location to another by copying the bytes, requiring no code at all to detect or handle errors. A program that does not use UTF-8 internally has to figure out what to do with errors in UTF-8 when it translates it to its internal form, this requires more than zero code and thus is literally "infinitely more complicated". Spitzak (talk) 19:51, 10 December 2024 (UTC)[reply]