Talk:Popularity of text encodings

This page was proposed for deletion by Thumperward (talk · contribs) on 13 March 2023.

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing
???	This article has not yet received a rating on the project's importance scale.

Writing systems

	Writing portal This article falls within the scope of WikiProject Writing systems, a WikiProject interested in improving the encyclopaedic coverage and content of articles relating to writing systems on Wikipedia. If you would like to help out, you are welcome to drop by the project page and/or leave a query at the project’s talk page.Writing systemsWikipedia:WikiProject Writing systemsTemplate:WikiProject Writing systemsWriting system
???	This article has not yet received a rating on the project's importance scale.

Typography

	This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TypographyWikipedia:WikiProject TypographyTemplate:WikiProject TypographyTypography
???	This article has not yet received a rating on the importance scale.

Proposed deletion

Probably ok to delete this, it was created to remove a block of bloat people kept adding to the UTF-8 page, pretty much covering what is the second-most-popular encoding in the world behind UTF-8 in various countries. The rest of the text is filler I added to try to make this article have an actual subject. Something must be done to prevent people from re-adding all this to UTF-8 however. Spitzak (talk) 19:28, 16 May 2023 (UTC)[reply]

yep 217.174.52.77 (talk) 15:53, 5 August 2023 (UTC)[reply]

I think this is good as it's own topic. The distribution of text encodings is an interesting subject. 50.46.252.164 (talk) 21:29, 12 September 2023 (UTC)[reply]

I don't favor deletion. The topic is important, and, as pointed out, not appropriate to be shoved into a UTF-8 topic.

However, I'm not very certain of the data quality. The figures cited for UTF-8 are a little higher than other sources I've seen, for example. 50.46.252.164 (talk) 21:34, 12 September 2023 (UTC)[reply]

The Cyrillic Comment about Being 2x as efficient as UTF-8 is misleading

The statement says that the native Cyrillic codepage is twice as efficient as UTF-8, however most Cyrillic websites still use UTF-8 despite that.

However, website content primarily consists of markup and tags that are not in the target language of the page. The markup is usually primarily ASCII. So, a Cyrillic web page is only very slightly less efficient in UTF-8 than a native codepage. This is true of most scripts/languages and UTF-8 vs a native codepage. 50.46.252.164 (talk) 21:28, 12 September 2023 (UTC)[reply]

The GB18030 statement is also misleading

Typically, Chinese webpages are using GB2312/GBK, or possibly effectively Windows 936, and not GB18030. 50.46.252.164 (talk) 21:32, 12 September 2023 (UTC)[reply]

The Argument for UTF-8 over UTF-16 internally is subjective.

"Recently it has become clear that the overhead of translating from/to UTF-8 on input and output, and dealing with potential encoding errors in the input UTF-8, vastly overwhelms any savings UTF-16 could offer" seems to be an unsupported opinion.

For example, "dealing with potential encoding errors in the input UTF-8" is just words. If the input UTF-8 is corrupt, then natively handling UTF-8 will also have to deal with the corrupted UTF-8 stream.

Additionally, most character property processing libraries, such as ICU, depend on data tables that are UTF-16. If you want to sort a bunch of Unicode strings linguistically, you're going to be converting them to UTF-1 to discover the sort weights. (or your library will need to do it for you.) Same thing if you're interested in character properties or normalization of the strings.

UTF-8 is certainly a valid choice, and good for many applications. However, I find the statement "vastly overwhelms any savings UTF-16 could offer" to be narrowminded. 50.46.252.164 (talk) 21:41, 12 September 2023 (UTC)[reply]

I don't know about "vastly", but you misunderstand the work needed to deal with corrupt UTF-8. Many programs that use UTF-8 internally can ignore corrupt data, for instance they can successfully copy a UTF-8 stream from one location to another by copying the bytes, requiring no code at all to detect or handle errors. A program that does not use UTF-8 internally has to figure out what to do with errors in UTF-8 when it translates it to its internal form, this requires more than zero code and thus is literally "infinitely more complicated". Spitzak (talk) 19:51, 10 December 2024 (UTC)[reply]