User talk:BilledMammal/Average articles
Appearance
Methodology queries
[edit]How does your algorithm count Dogtown, St. Louis, which has punctuation errors, or Balkan Rhapsodies: 78 Measures of War, which has multi-sentence quotations? I think I'd count the latter as 6 sentences, and someone else might count it as 11, but it's recorded as 13. I count 12 in Campbell v. Clinton and IEEE 802.11 (legacy mode), but your script says 13 for them. 1640 in paleontology is reported as 1 sentence, but it contains 5. WhatamIdoing (talk) 19:05, 14 August 2024 (UTC)
|
|
|
|
|
To address your specific questions:
- It doesn't always correctly parse abnormal punctuation, but that is a very difficult problem and the script is still reasonably effective at doing so.
- It does not exclude quotes, and I don't think it should
- It uses the source code of the page, so while 1640 in paleontology appears to have five it only has one, as most of the sentences are brought in by templates.
Overall, I think it's sufficient to establish that the median number of sentences is much higher than 4, even though one day I might want to come back and tweak it. BilledMammal (talk) 19:30, 14 August 2024 (UTC)
- I think this is great. I'd love to be able to provide some basic information, particularly to NPP and AFC reviewers, about what's typical.
Relative frequency of article lengths, as measured in sentences (statistical outliers excluded) - For example, in this sample, the mean is 47 sentences, the median is 13, the mode is 2, and the standard deviation is ±27, with a range of 0 to 1,056 sentences. I think the inner ranges tell more about the data set, though. The inner 80% range (1000 to 9000) is 2 to 60 sentences. The inner 90% (500 to 9,500) is 2 to 95 sentences. The inner 99% (100 to 9900) is 1 to 234 sentences. The quartiles fall at 5, 13, and 29 sentences. Anything above 67 sentences is a statistical outlier.
- The same numbers for the word count is a mean of 746 words, a median of 338 words, and a standard deviation of ±1400, with a range of 0 to 33,873 words. The inner 80% range is 47 to 1,667 words. The inner 90% is 32 to 2,692 words. The inner 99% is 21 to 6,813 words. Anything above 1,770 words is a statistical outlier.
- In other words, if you get more than a couple thousand words, it's an outlier, but even the shortest articles are normal.
- In addition to the number of words/sentences, I'm particularly interested in knowing the number of links and the number of sources in typical articles.
- Based on prior manual investigations (example), the lead of FAs tend to have 1.5 internal wikilinks per sentence and one per ~16 words. I'd expect the overall articles, especially longer articles, to have a lower link density.
- I don't know what to expect for a refs:sentence ratio, but I would not be surprised if the ratio was around one ref for every two or three sentences. It would probably be appropriate to exclude abnormally long articles from such calculations. WhatamIdoing (talk) 23:19, 14 August 2024 (UTC)
- About the script: I'm surprised that it did so well. I see that it's picking up the ==External links== section, which might tend to inflate the count, if descriptions are given. Overall, though, I suspect that the undercounts and overcounts approximately balance each other for sentence counts.
- For word counts, though, it may be systematically undercounting. Consider 1964 Malaysian state elections, which is almost entirely tables. There are obviously more than 21 words on the page, but actually counting them would be difficult. WhatamIdoing (talk) 23:34, 14 August 2024 (UTC)
- I've updated with the number of wikilinks and references. Currently, the reference check is a little simplistic; it looks for reference tags or any of the 4200 citation templates. This means that general references used without reference tags, like at 143rd New York Infantry Regiment, are missed. I'll look into how to address that.
- For word counts, possible - although in the case of 1964 Malaysian state elections, that's because most of the words are inside templates, which are excluded from the count. BilledMammal (talk) 21:47, 15 August 2024 (UTC)
- Thanks. Here's a quick summary of the numbers:
- Number of refs per article: Range of 0 to 452. Mean 8.63, median 4, mode 1, standard deviation 17.7. Quartiles are 2, 4, 9. Anything above 20 is an outlier (according to https://www.calculatorsoup.com/calculators/statistics/descriptivestatistics.php). The most common numbers were zero detected refs (9.35%), 1 ref (14.45%), 2 refs (14.17%), 3 (10.65%), 4 [the median] (7.58%), 5 (6.28%), and from there everything else (~44%) has less than 5% each.
- Number of internal links per article: Range of 0 to 4,624 (it's possible that September 2011 in sports should be considered a list article, in which case the range becomes 0 to 1,458), with a mean of 45, a median of 23, and a mode of 8. The standard deviation is 90. The quartiles fall at 12, 23, and 46. Anything at or above 98 is an outlier. A frequency table is interesting; whereas all the others cluster towards small numbers (e.g., two sentences, one ref...), 95% of articles have 5+ links.
- Combining this with the above numbers, the "perfect median" article has
- 338 words
- 13 sentences
- 4 refs
- 23 links
- and the inner quartiles (25th percentile to 75th percentile) – the most middling 50% of Wikipedia's articles, and therefore obviously "typical" content – have these ranges:
- 123–782 words
- 5–29 sentences
- 2–9 refs
- 12–46 links
- WhatamIdoing (talk) 04:40, 18 August 2024 (UTC)
- Here's another fun fact: Most articles in this sample set are statistical outliers for at least one of those four metrics.
- Looking only at 80% of articles that are "normal" on at least one of the four metrics, there is an average of 4.3 sentences per detectable ref, 18.4 words per wikilink, or one wikilink for every 1.3 sentences. The shorter the article, the greater the density on all elements.
- Looking at the 23% of articles that are "normal" on all four of the metrics, there is an average of 3.6 sentences per ref, 15 words per link, and one wikilink for every 1.6 sentences.
- The shorter half of the "all normal" articles (using the number of sentences to split them) have 2.6 sentences per ref and 12.8 words per wikilink (or 2 wikilinks per sentence). The longer half of these articles have 4.7 sentences per ref and 18.6 words per wikilink (or 1 wikilink per 1.25 sentences). WhatamIdoing (talk) 22:02, 26 August 2024 (UTC)
- Thanks. Here's a quick summary of the numbers: