Talk:Popularity of text encodings
This page was proposed for deletion by Thumperward (talk · contribs) on 13 March 2023. |
This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||
|
Proposed deletion
[edit]Probably ok to delete this, it was created to remove a block of bloat people kept adding to the UTF-8 page, pretty much covering what is the second-most-popular encoding in the world behind UTF-8 in various countries. The rest of the text is filler I added to try to make this article have an actual subject. Something must be done to prevent people from re-adding all this to UTF-8 however. Spitzak (talk) 19:28, 16 May 2023 (UTC)
- yep 217.174.52.77 (talk) 15:53, 5 August 2023 (UTC)
- I think this is good as it's own topic. The distribution of text encodings is an interesting subject. 50.46.252.164 (talk) 21:29, 12 September 2023 (UTC)
- I don't favor deletion. The topic is important, and, as pointed out, not appropriate to be shoved into a UTF-8 topic.
- However, I'm not very certain of the data quality. The figures cited for UTF-8 are a little higher than other sources I've seen, for example. 50.46.252.164 (talk) 21:34, 12 September 2023 (UTC)
The Cyrillic Comment about Being 2x as efficient as UTF-8 is misleading
[edit]The statement says that the native Cyrillic codepage is twice as efficient as UTF-8, however most Cyrillic websites still use UTF-8 despite that.
However, website content primarily consists of markup and tags that are not in the target language of the page. The markup is usually primarily ASCII. So, a Cyrillic web page is only very slightly less efficient in UTF-8 than a native codepage. This is true of most scripts/languages and UTF-8 vs a native codepage. 50.46.252.164 (talk) 21:28, 12 September 2023 (UTC)
The GB18030 statement is also misleading
[edit]Typically, Chinese webpages are using GB2312/GBK, or possibly effectively Windows 936, and not GB18030. 50.46.252.164 (talk) 21:32, 12 September 2023 (UTC)
The Argument for UTF-8 over UTF-16 internally is subjective.
[edit]"Recently it has become clear that the overhead of translating from/to UTF-8 on input and output, and dealing with potential encoding errors in the input UTF-8, vastly overwhelms any savings UTF-16 could offer" seems to be an unsupported opinion.
For example, "dealing with potential encoding errors in the input UTF-8" is just words. If the input UTF-8 is corrupt, then natively handling UTF-8 will also have to deal with the corrupted UTF-8 stream.
Additionally, most character property processing libraries, such as ICU, depend on data tables that are UTF-16. If you want to sort a bunch of Unicode strings linguistically, you're going to be converting them to UTF-1 to discover the sort weights. (or your library will need to do it for you.) Same thing if you're interested in character properties or normalization of the strings.
UTF-8 is certainly a valid choice, and good for many applications. However, I find the statement "vastly overwhelms any savings UTF-16 could offer" to be narrowminded. 50.46.252.164 (talk) 21:41, 12 September 2023 (UTC)
- I don't know about "vastly", but you misunderstand the work needed to deal with corrupt UTF-8. Many programs that use UTF-8 internally can ignore corrupt data, for instance they can successfully copy a UTF-8 stream from one location to another by copying the bytes, requiring no code at all to detect or handle errors. A program that does not use UTF-8 internally has to figure out what to do with errors in UTF-8 when it translates it to its internal form, this requires more than zero code and thus is literally "infinitely more complicated". Spitzak (talk) 19:51, 10 December 2024 (UTC)