Wikipedia talk:Database reports/Archive 1
This is an archive of past discussions about Wikipedia:Database reports. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 | → | Archive 5 |
Uncategorized categories
- Thanks for setting this up. I use Wikipedia:Database reports/Uncategorized categories a lot, and I have some ideas for improving it:
- I'd appreciate if the report were run more often than once a week. How about twice a week? Every other day?
- The first thing I do when looking at an uncategorized category is to check the page history for recent vandalism. It'd be nice if the report included "(hist)" links to the category page histories -- similar to Special:Watchlist.
- Many of the categories reported are empty ones that will eventually be speedily deleted. It would help if the report gave some indication of whether the category was empty when the report was run -- perhaps even the number of pages in the category.
- Many of the categories reported are blanked ones awaiting deletion. It would help if the report gave some indication of whether the page was blank when the report was run -- perhaps even a bytecount for the category page.
- That's all for now. Keep up the good work! - Stepheng3 (talk) 20:20, 29 October 2008 (UTC)
- Done. The script will now run at midnight on Monday, Wednesday, and Friday. I've wrapped the page titles in a template, Template:clh, so they now include a (hist) link. I put all of the data into a sortable table with columns for page length and category members. Note: some of the category members fields are blank. This isn't a bug in the script, this is more of an issue with the way MediaWiki operates. The only thing remaining that I might want to add is a counter field before the page title so that it gives some sort of idea how many rows are in the table. Let me know if there's anything else you need / want. Cheers. --MZMcBride (talk) 02:50, 30 October 2008 (UTC)
- Very nice. Thanks for the quick response. I'm thinking about ways to improve Wikipedia:Database reports/Empty categories, so I should have more suggestions for you soon. - Stepheng3 (talk) 03:36, 30 October 2008 (UTC)
- One minor thing that was lost in the upgrade: it used to be easy to tell how many pages were in the list. As far as I'm concerned, this isn't a big issue, but it would be nice to have again at some point. - Stepheng3 (talk) 06:42, 30 October 2008 (UTC)
- Done. --MZMcBride (talk) 07:48, 30 October 2008 (UTC)
- Done. The script will now run at midnight on Monday, Wednesday, and Friday. I've wrapped the page titles in a template, Template:clh, so they now include a (hist) link. I put all of the data into a sortable table with columns for page length and category members. Note: some of the category members fields are blank. This isn't a bug in the script, this is more of an issue with the way MediaWiki operates. The only thing remaining that I might want to add is a counter field before the page title so that it gives some sort of idea how many rows are in the table. Let me know if there's anything else you need / want. Cheers. --MZMcBride (talk) 02:50, 30 October 2008 (UTC)
Now that we've got Last edit and Last user columns for WP:Database reports/Empty categories, how hard would it be to add the same features to WP:Database reports/Uncategorized categories? - Stepheng3 (talk) 21:36, 5 November 2008 (UTC)
- Done. --MZMcBride (talk) 00:28, 6 November 2008 (UTC)
- Thanks! - Stepheng3 (talk) 00:39, 6 November 2008 (UTC)
Empty categories
I've begun working through the Wikipedia:Database reports/Empty categories, and I have a few suggestions:
- It looks like some of these categories have subcats. (See Category:North_Carolina_sportspeople for example. I think categories with subcats should be excluded from the report. If they are to be included, then please give a subcat count for each page on the list.
- This is another report in which "(hist)" links would be handy.
- Many of the categories reported are blanked ones awaiting deletion. It would help if the report segregated these or gave some indication of whether the page was blank when the report was run -- perhaps a bytecount for each Category: page.
- It would convenient to be able to sort empty cats by the date of the most recent edit on their Category: pages. That would make it easier to find empty cats that haven't been edited in the past four days. (Category talk: pages should probably be ignored here.)
- I'd also like to be able to see/sort empty cats by the username of the last editor of each Category: page. This would make it easier to find/recognize pages I've tagged. It would also allow extra attention to be given to IP edits.
- Also, please consider excluding or segregating project categories whose names match the following regular expressions:
- /_articles_needing_/
- /_articles_with_comments$/
- /_articles_without_infoboxes$/
I hope these ideas inspire someone to upgrade this valuable report. Thanks in advance, - Stepheng3 (talk) 07:35, 30 October 2008 (UTC)
- I'm pretty sure categories with subcategories are excluded. If you look at your example, you will see that all 3 subcats were recently added to that category and was probably caught inbetween when the report was running and when they were added. As for your last point, I'm not so sure those categories are exempt from deletion. If you look at WP:CSD#C1 (which I wrote the wording for), it will say "project categories that by their nature may become empty on occasion" - Are these categories that by their nature are to be empty on occasion? An argument can be made that they aren't, and hence are not exempt by C1 (personally I haven't deleted these types of categories myself, however, but I wouldn't object to their deletion. IMO we don't need empty categories waiting around to be populated, and if they are they can be easily restored). VegaDark (talk) 15:42, 30 October 2008 (UTC)
- I think it's reasonable to expect _articles_needing_ and _articles_without_infoboxes to become empty from time to time, as the needs are met and the infoboxes are supplied. I'm less sure about _articles_with_comments, so maybe keep those in the report for now. - Stepheng3 (talk) 23:04, 30 October 2008 (UTC)
- I'm actively working on this. The main issue is that there about 1 million Category: pages and the revision table is simply enormous (about 280,000,000 rows) so it's going to require a very optimized query or using the API to come up with the output for this data. I can tell you that #2 on your list is trivial and #1 is unnecessary as the number of subcats will always be zero. #3 comes from the page table so it's trivial as well. #4 and #5 are the ones that require joining on the revision table. --MZMcBride (talk) 23:14, 31 October 2008 (UTC)
- The difficult changes can wait. After all, this information is just a single click away.
- I have another idea. How about archiving these reports for a week or two, so that an editor tagging with {{db-catempty}} could reference the relevant report(s) without having to worry about them getting overwritten before an admin comes along? - Stepheng3 (talk) 23:23, 31 October 2008 (UTC)
- Isn't the page history a sufficient archive? VegaDark (talk) 00:42, 1 November 2008 (UTC)
- Good point! The old Special: pages didn't have history, and it hadn't dawned on me that (of course) the project pages have history. Hm, I think I can put this to good use. - Stepheng3 (talk) 02:05, 1 November 2008 (UTC)
- Isn't the page history a sufficient archive? VegaDark (talk) 00:42, 1 November 2008 (UTC)
- I'm actively working on this. The main issue is that there about 1 million Category: pages and the revision table is simply enormous (about 280,000,000 rows) so it's going to require a very optimized query or using the API to come up with the output for this data. I can tell you that #2 on your list is trivial and #1 is unnecessary as the number of subcats will always be zero. #3 comes from the page table so it's trivial as well. #4 and #5 are the ones that require joining on the revision table. --MZMcBride (talk) 23:14, 31 October 2008 (UTC)
- I think it's reasonable to expect _articles_needing_ and _articles_without_infoboxes to become empty from time to time, as the needs are met and the infoboxes are supplied. I'm less sure about _articles_with_comments, so maybe keep those in the report for now. - Stepheng3 (talk) 23:04, 30 October 2008 (UTC)
(unindent) w00t! Looks like I've got it working now. :D I'm going to hold off on #6 while discussion continues (though it is trivial to implement once there's a decision about what words / phrases to filter out). --MZMcBride (talk) 20:14, 2 November 2008 (UTC)
- Congrats. I've been busy too.
- Other phrases to filter out or segregate:
- _articles_missing_
- _articles_in_need_of_
- _articles_undergoing_
- _articles_to_be_
- _articles_not_yet_
- To me, all these phrases imply that the articles in question are waiting for something to happen, and I would be reluctant to tag them for C1. Just to be safe, I suggest collecting empty categories that match these patterens in a separate table where human beings can browse them if desired. Same goes for _articles_needing_ and all the phrases currently filtered out. I wouldn't want somtheing along the lines of Category:Tax assessors to get overlooked. - Stepheng3 (talk) 05:09, 4 November 2008 (UTC)
Indefinitely semi-protected articles
We could use some review of the long-term semi-protection on many articles. We could run the report daily, or even weekly. Thanks, HiDrNick! 15:47, 12 November 2008 (UTC)
- Done. Couple of notes about current implementation:
- It includes both redirects and non-redirects
- The log comment field is <nowiki>'d to avoid template expansion (this could probably be adjusted to a regex that changes "{{" to "{{tl|")
- There are about 60 titles that are not included in the list because they do not have an entry in the logging table, even though the pages themselves are semi-protected (list available here); from a quick glance, the issue appears to be that the pages were protected under a different title and then subsequently moved
- --MZMcBride (talk) 22:06, 12 November 2008 (UTC)
- Sweet, thanks. Now I just need to find an admin who needs to grind their unprotecting skill. HiDrNick! 20:08, 15 November 2008 (UTC)
- With the help of CBM and bjweeks, this list is now much improved. The page_title of redirects are now wrapped in italics and the issue with the missing 60 titles has been resolved. I've decided to stick to using <nowiki> for the log_comment field because other types of links and syntax can cause problems (e.g., [[Image: or [[Category:). Cheers. --MZMcBride (talk) 06:11, 18 November 2008 (UTC)
- Sweet, thanks. Now I just need to find an admin who needs to grind their unprotecting skill. HiDrNick! 20:08, 15 November 2008 (UTC)
additional report ideas from Wikipedia:Criteria for speedy deletion/Deletion templates
- How about a report listing talk pages for pages that don't exist? (Including Category, Wikipedia, User, Image, Template, and Portal talk pages.)
- How about one for user pages of nonexistant users?
- And one for redirects to user pages (not including User talk pages)?
If these reports already exist, or if these pages are tagged/deleted by bots, that would be nice to know. If not, let's start with monthly reports and see how much cleanup work there is. - Stepheng3 (talk) 22:54, 31 October 2008 (UTC)
- #3 already exists at Wikipedia:Database reports/Cross-namespace redirects. #1 is a bit tricky. Generally you have to exclude any title that contains a / because of things like archives. Image_talk: pages are often "orphaned" because the image (and its description page) exists at Commons. User_talk: pages are very regularly no part of a set, so that namespace will have to be excluded entirely. User pages of non-existent users (#2) is a bit tricky because (a) user names are stored without underscores while page titles are stored with underscores (so a direct comparison first has to use a REPLACE); (b) often accounts are renamed and the old user page is redirected so they appear to not belong to a user but they really do; (c) very old accounts have been "deleted" to speak in some cases, esp. if the user originally registered with a lowercase name. That doesn't preclude these reports from being generated, per se, it just complicates matters a bit and will make the results not what is expected in a lot of cases.
Also, you may be interested in the /Magic subpage in my userspace where I list all of the things you've listed here and more. Cheers. --MZMcBride (talk) 23:23, 31 October 2008 (UTC)
categories that contain themselves
How about a report showing categories that contain themselves as members? I've run into this situation dozens of times in the past year. Some editor (who is probably a bit unclear on how categories are intended to work) creates a category Category:Foo with the text [[Category:Foo]] in it. Or someone copying categories from the main article forgets to delete the category in question. The resulting category is categorized, non-zero length, and non-empty. I think there might be some cleanup work to do in this area.
And if you get ambitious, maybe you could write a report to identify cycles in the category hierarchy.
A third alternative might be to modify Wikipedia:Database reports/Uncategorized categories so that the category in question is excluded from being counted as a parent. This would be the least thorough approach, though probably the easiest. - Stepheng3 (talk) 19:47, 1 November 2008 (UTC)
- Done at Wikipedia:Database reports/Self-categorized categories. I won't set up a schedule for the query until you or I (or someone else) can determine the usefulness of the results. It looks like a lot of the self-categorizations are coming from templates, if that makes a difference. --MZMcBride (talk) 22:25, 2 November 2008 (UTC)
- Interesting report. I deleted several that have existed for years that contained nothing but themselves. A good chunk of the categories are user categories, and the usual reason for this is that users place the userbox in the category. 99% of userboxes don't have their coding so the category is only added to pages in the userspace, which is what needs to be done in order to prevent these categories from containing themselves (substing the templates and removing the portion of the code that adds the category would also work, but I wouldn't recommend that because 1) The userbox will no longer update when changes are made, and 2) Often userboxes are added to more than just category pages, so making it so the code itself prevents the category from showing up outside of userspace pages will provide more benefit overall for the encyclopedia). VegaDark (talk) 00:05, 3 November 2008 (UTC)
- Looks like a good excuse for me to learn about template syntax. Any tips? - Stepheng3 (talk) 20:31, 3 November 2008 (UTC)
Well, the general trick is to wrap the [[Category:Foo]] links in a {{{category}}} parameter and then define it as null where you don't want to have the category show up. So, for example:
Template:
Foo {{{1}}} ''{{{2}}}''<includeonly>[[Category:Bar]]</includeonly><noinclude> Docs... </noinclude>
That would be changed to:
Foo {{{1}}} ''{{{2}}}''<includeonly>{{{category|[[Category:Bar]]}}}</includeonly><noinclude> Docs... </noinclude>
Transclusions of the template need to be changed from:
{{template}}
To:
{{template|category=}}
Essentially, you're defining a category parameter (and defaulting it to the regular category) while then setting it as null for specific pages. --MZMcBride (talk) 04:26, 4 November 2008 (UTC)
- Thanks. I think I can handle the simple cases now. Let's see how far I get... - Stepheng3 (talk) 22:48, 5 November 2008 (UTC)
- Hi.. I've been working my way through the self-catted cats. For fixing userboxes, I've been using:
<includeonly>{{#ifeq:{{ns:2}}|{{NAMESPACE}}|[[Category:FOO|{{PAGENAME}}]]|}}</includeonly>
- This means that any extant uses of the template/UBX don't change, and don't need params added; the category is automatically excluded when on category pages, and automatically included on userpages. //roux 05:17, 20 November 2008 (UTC)
possible bug in Uncategorized categories report
Why does Category:Slovenia national football team templates keep showing up in this report? - Stepheng3 (talk) 01:28, 22 December 2008 (UTC)
- Hmm. Not sure. I null edited the page. Let's see if it disappears from the list when the list updates again. --MZMcBride (talk) 18:28, 1 January 2009 (UTC)
- It's still there. - Stepheng3 (talk) 02:04, 5 January 2009 (UTC)
mysql> SELECT * FROM categorylinks WHERE cl_to="Slovenia_national_football_team_templates";
+----------+-------------------------------------------+-----------------------------------+---------------------+
| cl_from | cl_to | cl_sortkey | cl_timestamp |
+----------+-------------------------------------------+-----------------------------------+---------------------+
| 7314367 | Slovenia_national_football_team_templates | Slovenia Squad 2002 World Cup | 2006-10-06 04:20:14 |
| 17894927 | Slovenia_national_football_team_templates | Template:Slovenia Squad 2000 Euro | 2008-06-11 19:27:50 |
+----------+-------------------------------------------+-----------------------------------+---------------------+
2 rows in set (0.02 sec)
That looks right to me. The Toolserver is still a bit lagged (about four days), so I'm inclined to wait until the next update to see if this is resolved. --MZMcBride (talk) 04:53, 5 January 2009 (UTC)
- It looks like this is fixed now. Thanks. - Stepheng3 (talk) 01:24, 7 January 2009 (UTC)
is this thing working?
I notice that none of the reports were updated on Wednesday or today (Thursday) -- not even the ones marked "daily". Is something broken? - Stepheng3 (talk) 18:16, 1 January 2009 (UTC)
- Yep, the Toolserver is still partially down. At least the important bit is (the database access for enwiki_p). Worse, when it comes back online, there will likely be quite a bit of replag. :-( --MZMcBride (talk) 18:26, 1 January 2009 (UTC)
- Thanks for the quick response. I suppose life will go on with or without the reports. - Stepheng3 (talk) 18:42, 1 January 2009 (UTC)
Is there a way to generate these with more current data? The galleries display images included in {{infobox lake}} as "image_lake". The name of the lake displayed is from "lake_name" and links back to the lakes article. I break it down manually in pages with 200 images.
Currently I use Templatetiger. Its data is updated once or twice a year. Its January 2009 update is with data from 2008-10-08. -- User:Docu
- Hmm, this is possible, but it would likely require the use of the API, as I don't see any way for the info to come from just the Toolserver. So I guess it really just depends how much you want this. Though it seems specific enough that you may have much better luck at WP:BOTREQ. --MZMcBride (talk) 04:35, 23 January 2009 (UTC)
So I'd need to read Infobox lake and then each article directly from the site? Maybe I can get this to work with pywikipediabot. BTW is there one that reads template fields? -- User:Docu
- Template parsing is a real bitch. I doubt pywikipedia has any good tools for it, but somebody at WP:BOTREQ might. --MZMcBride (talk) 17:15, 30 January 2009 (UTC)
After looking through some of the tools, with a few changes, archivebot.py might do it. -- User:Docu
- Assuming each template parameter is on its own line, probably doable. But if not, that script will almost surely break. :-) --MZMcBride (talk) 19:35, 30 January 2009 (UTC)
- It did work. Thanks for your feedback. -- User:Docu
Pages with the most revisions
This report appears to be broken. Is anyone looking into this? - Stepheng3 (talk) 04:49, 24 January 2009 (UTC)
- I ran it on testwiki_p (http://test.wikipedia.org) first to make sure everything worked properly. It's currently running on enwiki_p. It has been running for 89302 seconds (about 24 hours). It's unclear whether it will ever finish.... --MZMcBride (talk) 05:17, 24 January 2009 (UTC)
- w00t, it finished. :-) But it's pretty clear that it can't be run on a set schedule as it takes about two days to finish. So I've updated the main page to indicate that it will be run manually. Cheers. --MZMcBride (talk) 20:38, 24 January 2009 (UTC)
- Thanks for the additional information. It certainly is a very interesting report. - Stepheng3 (talk) 23:28, 24 January 2009 (UTC)
- w00t, it finished. :-) But it's pretty clear that it can't be run on a set schedule as it takes about two days to finish. So I've updated the main page to indicate that it will be run manually. Cheers. --MZMcBride (talk) 20:38, 24 January 2009 (UTC)
Category:Wikipedia featured topics FOO featured/good content
I've been unsuccessful at trying to convince rst20xx to remove this exemption he added to the C1 criteria. Could you filter these categories out of the empty categories report? Here are a couple examples:
- Category:Wikipedia_featured_topics_ExampleTopicName_good_content
- Category:Wikipedia_featured_topics_German_Type_U_66_submarines_featured_content
While I've got your attention, I'm reiterating my request for categories whose names contain the following phrases to be filtered out, or at least segregated in some way from the main list of empty categories:
- _articles_missing_
- _articles_in_need_of_
- _articles_undergoing_
- _articles_to_be_
- _articles_not_yet_
In my experience, categories of this sort are "project categories that by their nature may become empty on occasion". --Stepheng3 (talk) 21:00, 12 February 2009 (UTC)
Article referencing statistics
Over at Wikipedia:Improving referencing efforts we're trying to improve article referencing. It would be helpful if we had statistics on the proportion of articles with references, average references per article, etc. Ideally, I would like to see to compare a year ago to this year, or even monthly stats if that would be possible. Thanks. - Peregrine Fisher (talk) (contribs) 22:03, 15 December 2008 (UTC)
- Sorry, I completely missed this thread being posted. You're likely much better off with a request at WP:BOTREQ. This type of thing requires a database dump to get the page text and analyze it. Cheers. --MZMcBride (talk) 18:24, 13 February 2009 (UTC)
Pages with unusual protection levels
Pages with a protection level [edit=none], [move=autoconfirmed], or with a [edit=sysop] and not [move=sysop]. As it confuses bots and protection templates (through {{PROTECTIONLEVEL}}). Thanks, Cenarium (talk) 14:19, 13 February 2009 (UTC)
- This should be doable.... --MZMcBride (talk) 18:33, 13 February 2009 (UTC)
Someone should probably check my logic here, but this should be what you want.
mysql> SELECT
-> CONCAT('# [[',ns_name,':',page_title,']]')
-> FROM page_restrictions AS pr1
-> JOIN page
-> ON pr1.pr_page = page_id
-> JOIN toolserver.namespace
-> ON page_namespace = ns_id
-> AND dbname = "enwiki_p"
-> WHERE pr1.pr_type = "move"
-> AND pr1.pr_level = "autoconfirmed"
-> AND NOT EXISTS (SELECT
-> 1
-> FROM page_restrictions AS pr2
-> WHERE pr1.pr_page = pr2.pr_page
-> AND pr2.pr_type != "move");
+----------------------------------------------+
| CONCAT('# [[',ns_name,':',page_title,']]') |
+----------------------------------------------+
| # [[:Lysergic_acid_diethylamide]] |
| # [[Talk:Serbo-Croatian_language/Archive_2]] |
| # [[:Scrubs_(TV_series)]] |
| # [[:Interscope_Records]] |
| # [[:Lost]] |
| # [[User talk:Anonymous_editor]] |
| # [[User talk:212.85.18.7]] |
| # [[Template:Intelligent_Design]] |
| # [[Portal:Taiwan]] |
| # [[User:Titoxd/Sandbox2]] |
| # [[Template:Essay]] |
| # [[:DADVSI]] |
| # [[User talk:210.217.33.187]] |
| # [[User talk:Hillman]] |
| # [[:George_Augustus_Stallings]] |
| # [[User talk:Jersyko/archive10]] |
| # [[User talk:Kafka_Liz]] |
| # [[:Glen_A._Wilson_High_School]] |
| # [[User talk:Gavin.collins/Archive_6]] |
| # [[User talk:Peter_Damian_(old)]] |
| # [[User talk:69.180.82.43]] |
| # [[User talk:EEMIV/Archive10]] |
| # [[User talk:EEMIV]] |
| # [[Talk:Serbo-Croatian_language]] |
+----------------------------------------------+
24 rows in set (0.18 sec)
mysql> SELECT
-> CONCAT('# [[',ns_name,':',page_title,']]')
-> FROM page_restrictions AS pr1
-> JOIN page
-> ON pr1.pr_page = page_id
-> JOIN toolserver.namespace
-> ON page_namespace = ns_id
-> AND dbname = "enwiki_p"
-> WHERE pr1.pr_type = "edit"
-> AND pr1.pr_level = "sysop"
-> AND NOT EXISTS (SELECT
-> 1
-> FROM page_restrictions AS pr2
-> WHERE pr1.pr_page = pr2.pr_page
-> AND pr2.pr_type = "move"
-> AND pr2.pr_level = "sysop");
+--------------------------------------------+
| CONCAT('# [[',ns_name,':',page_title,']]') |
+--------------------------------------------+
| # [[Template:WP_Mesoamerica_subpage]] |
| # [[User talk:Railer_198]] |
| # [[Template:IPsock]] |
| # [[User:Happy-melon/sandbox4]] |
| # [[User:Kingturtle/Warningkt]] |
| # [[User:Sceptre/contactme]] |
| # [[User:Sceptre/userpage]] |
| # [[User talk:Fatloss]] |
| # [[User talk:Stifle/wizard/images]] |
+--------------------------------------------+
9 rows in set (0.48 sec)
Don't have time at the moment to make a proper database report, but at least you have the results. :-) --MZMcBride (talk) 15:31, 14 February 2009 (UTC)
- Thanks :) They are not so much of them. I adjusted the protection levels, so you won't need to make a report. Cenarium (talk) 18:41, 14 February 2009 (UTC)
Self redirects
SELECT *
FROM page
INNER JOIN pagelinks ON page_id = pl_from
WHERE page_namespace = 0
AND pl_namespace=0 /* optional */
/* AND page_len > 2024 to skip stubs */
AND page_title IN (SELECT rd_title
FROM redirect
INNER JOIN page ON rd_from = page_id
WHERE page_title = pl_title AND page_namespace=pl_namespace)
LIMIT 50;
This query will find links that redirect to the originating page. This can happened because:
- It redirects to a section
- The page was recently moved and the navbox hasn’t been updated
- The redirecting page was merge into the main one
In all cases it is annoying for the reading to have to reload the same page. — Dispenser 17:25, 13 February 2009 (UTC)
Not sure I understand this query. Here are sample results of the query for enwiki_p with LIMIT 5.
mysql> SELECT *
-> FROM page
-> INNER JOIN pagelinks ON page_id = pl_from
-> WHERE page_namespace = 0
-> AND pl_namespace=0 /* optional */
-> /* AND page_len > 2024 to skip stubs */
-> AND page_title IN (SELECT rd_title
-> FROM redirect
-> INNER JOIN page ON rd_from = page_id
-> WHERE page_title = pl_title AND page_namespace=pl_namespace)
-> LIMIT 5;
+----------+----------------+-------------------------------------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+----------+--------------+-------------------------------------------+
| page_id | page_namespace | page_title | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random | page_touched | page_latest | page_len | pl_from | pl_namespace | pl_title |
+----------+----------------+-------------------------------------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+----------+--------------+-------------------------------------------+
| 19813130 | 0 | "D"_Is_for_Dubby_-_The_Lustmord_Dub_Mixes | | 0 | 0 | 0 | 0.671963747482 | 20081214074626 | 257866114 | 1356 | 19813130 | 0 | "D"_is_for_Dubby_-_The_Lustmord_Dub_Mixes |
| 9582421 | 0 | "Them" | | 0 | 0 | 0 | 0.415525086755 | 20090211013015 | 265439711 | 4810 | 9582421 | 0 | "Them"_(King_Diamond_album) |
| 12571133 | 0 | "V"_Is_for_Vagina | | 0 | 0 | 0 | 0.364994536024 | 20090213022315 | 270356944 | 6165 | 12571133 | 0 | "V"_is_for_Vagina |
| 16079543 | 0 | "V"_Is_for_Viagra._The_Remixes | | 0 | 0 | 0 | 0.409772390435 | 20081216101637 | 258331209 | 3187 | 16079543 | 0 | "V"_is_for_Viagra._The_Remixes |
| 7630017 | 0 | "Weird_Al"_Yankovic_discography | | 0 | 0 | 0 | 0.110680981041 | 20090211013015 | 269028703 | 20424 | 7630017 | 0 | List_of_singles_by_"Weird_Al"_Yankovic |
+----------+----------------+-------------------------------------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+----------+--------------+-------------------------------------------+
5 rows in set (9.06 sec)
I don't think this is finding what you're after (self-redirects).
--MZMcBride (talk) 18:18, 13 February 2009 (UTC)
Oh, I think I just understood what you're talking about. This is probably not feasible for a wiki page (likely far too many results), but it can certainly be saved to a text file (either as a one-time thing or regularly). The pagelinks table is enormous, however, so it could take a very long time to run and may require LIMITs and such.... --MZMcBride (talk) 18:41, 13 February 2009 (UTC)
Toolbox links
In the toolbox area on the left of the Wikipedia page on the "Special pages" page, several links don't work. How do I delete them? There is nothing on the "Unused files" page. There is nothing on the "wanted pages" page. There is nothing on the "unused templates page". I mentioned the issue on ANI and it was suggested I come here. ChildofMidnight (talk) 07:43, 9 March 2009 (UTC)
- Sorry, I was suggesting that instead of looking at special pages for things which have ceased updating, you find their replacements at database reports. I suspect that the composition of the "special page links is something determined by the software and may not be manipulable at the editor level. Protonk (talk) 08:33, 9 March 2009 (UTC)
Polluted categories database report
I think we should have a template for exclusion from this report. Category:Wikipedia noindex pages, for example, should probably not be in it, even though it fits the formal criteria. עוד מישהו Od Mishehu 09:31, 9 March 2009 (UTC)
- If you make a template and start tagging pages, I should be able to exclude pages that use the template in the future. :-) --MZMcBride (talk) 04:15, 10 March 2009 (UTC)
Quick note. I forgot to sync this file to the Toolserver, so when it ran just recently, it was using the old query. I'm re-running it manually now to exclude the tagged category pages. --MZMcBride (talk) 09:21, 21 March 2009 (UTC)
Articles which are directly included in stub categories
I think a monthly report of articles with direct use (i.e not using templates) of categories ending with the word "stubs" would be useful. עוד מישהו Od Mishehu 13:43, 11 March 2009 (UTC)
- This sounds difficult. Got an example page? --MZMcBride (talk) 19:14, 11 March 2009 (UTC)
- This version of Male prostitution would have been an example (directly included in Category:Sex stubs); the next one (uses a stub template) wouldn't. עוד מישהו Od Mishehu 06:18, 12 March 2009 (UTC)
- I don't think this is really possible. Detecting categories from templates versus categories directly inserted isn't possible from the database. You'd have to scan the page text looking for certain matches. --MZMcBride (talk) 03:53, 18 March 2009 (UTC)
- This version of Male prostitution would have been an example (directly included in Category:Sex stubs); the next one (uses a stub template) wouldn't. עוד מישהו Od Mishehu 06:18, 12 March 2009 (UTC)
Switched framework
I've switched nearly all of the scripts to wikitools from pywikipedia. Most of the changes were untested, so invariably something has broken. Let me know if you see anything and I'll fix it as soon as possible. Cheers. --MZMcBride (talk) 19:53, 11 March 2009 (UTC)
- Orphaned talk pages hasn't been updated in over two weeks.--Sloane (talk) 02:50, 19 March 2009 (UTC)
- Yes, that's sort of intentional. Blame ArbCom. --MZMcBride (talk) 20:15, 21 March 2009 (UTC)
Pages with test edits
There are many pages with remains of test edits, a database report would be an efficient way to detect them, maybe one for articles, and another one for non-user pages (it'll probably return a lot of talk pages, in talk:, file talk:, etc). We can start with pages containing one of those elements:
"'''Bold text'''", "''Italic text''", "[[Link title]]", [http://www.example.com link title]", "== Headline text ==", "[[File:Example.jpg]]", "[[Image:Example.jpg]]",
"[[Media:Example.ogg]]", "<math>Insert formula here</math>", "<nowiki>Insert non-formatted text here</nowiki>", "#REDIRECT [[Target page name]]", "<s>Strike-through text</s>", "<sup>Superscript text</sup>", "<sub>Subscript text</sub>", "<small>Small Text</small>", "<!-- Comment -->", "Image:Example.jpg|Caption" or "<ref>Insert footnote text here</ref>" .
Thanks, Cenarium (talk) 17:09, 25 March 2009 (UTC)
- If tags like <ref> cannot be searched, searching the text only, e.g. "Insert footnote text here" should give comparable results. Cenarium (talk) 18:11, 27 March 2009 (UTC)
- The Toolserver doesn't have any easy text access, so searching for specific text strings is only possible using database dumps. (WP:BOTREQ may have some people capable of handling those, though the dumps are currently rather old.) However, for things like links, a report can be generated though you'd likely get more up-to-date information by simply looking at places like Special:WhatLinksHere/Target page name and Special:WhatLinksHere/Image:Example.jpg. Cheers. --MZMcBride (talk) 18:24, 27 March 2009 (UTC)
Indefinitely protected talk pages, especially UT pages
^Somewhat self-explanatory. Monthly? –xeno (talk) 18:40, 30 March 2009 (UTC)
- Done. I excluded archives and set it to run weekly. --MZMcBride (talk) 22:01, 30 March 2009 (UTC)
- Thanks again =) –xeno (talk) 00:53, 31 March 2009 (UTC)
- It's probably a good idea to look at FULLY protected talk pages as well, if any exist. cheers, –xeno (talk) 13:36, 31 March 2009 (UTC)
- & perhaps exclude all subpages, or have them in a separate section? –xeno (talk) 13:43, 31 March 2009 (UTC)
- Done. It's a large data set. I didn't separate non-redirects from redirects because (a) it's a pain in the ass to do with pagination; and (b) there weren't that many redirects. It's my recommendation that a lot of these User_talk: pages be deleted. --MZMcBride (talk) 19:57, 31 March 2009 (UTC)
- My there's a lot of 'em, isn't there??? Thanks, –xeno (talk) 20:51, 31 March 2009 (UTC)
- Done. It's a large data set. I didn't separate non-redirects from redirects because (a) it's a pain in the ass to do with pagination; and (b) there weren't that many redirects. It's my recommendation that a lot of these User_talk: pages be deleted. --MZMcBride (talk) 19:57, 31 March 2009 (UTC)
- & perhaps exclude all subpages, or have them in a separate section? –xeno (talk) 13:43, 31 March 2009 (UTC)
- Would it be possible to remove all subpages (not just archives) and also perhaps for the full protected ones, to zero in on fully protected talk pages of users who are actively editing? –xeno talk 15:01, 3 May 2009 (UTC)
- Hmm, gonna give this some more thought. Not sure the best way to implement this cleanly. Perhaps additional reports that different between talk pages and talk subpages. Though that seems to be a bit excessive. If I exclude subpages, I want to do it only for the user space. (Maybe. Actually looking at these pages, the subpages in places like the Wikipedia: namespace are pretty boring....) So perhaps I'll just exclude all subpages and call it a day. It would also be nice if the it added ?redirect=no to redirect links in the fully-protected report.... --MZMcBride (talk) 18:38, 3 May 2009 (UTC)
- Now that you mention it, the fully protected talk page redirects are pretty mundane as well (redirects from renames mostly), but I don't want to make the report too complicated =). –xeno talk 18:41, 3 May 2009 (UTC)
- Excluding redirects and subpages is easy enough. Done. In a subsequent commit I fixed the note at the top of the report, so that will be fixed the next time this report runs automatically (Thursday or so).
- Should cut out a lot of chaff. cheers mate. –xeno talk 18:53, 3 May 2009 (UTC)
- Excluding redirects and subpages is easy enough. Done. In a subsequent commit I fixed the note at the top of the report, so that will be fixed the next time this report runs automatically (Thursday or so).
- Now that you mention it, the fully protected talk page redirects are pretty mundane as well (redirects from renames mostly), but I don't want to make the report too complicated =). –xeno talk 18:41, 3 May 2009 (UTC)
- Hmm, gonna give this some more thought. Not sure the best way to implement this cleanly. Perhaps additional reports that different between talk pages and talk subpages. Though that seems to be a bit excessive. If I exclude subpages, I want to do it only for the user space. (Maybe. Actually looking at these pages, the subpages in places like the Wikipedia: namespace are pretty boring....) So perhaps I'll just exclude all subpages and call it a day. It would also be nice if the it added ?redirect=no to redirect links in the fully-protected report.... --MZMcBride (talk) 18:38, 3 May 2009 (UTC)
Orphan subpages of AFD
There are subpages of Wikipedia:Articles for deletion that are created with strange content such as this or this There are also AFD noms that are not transcluded, or very late (example). A list of subpages of AFD that are orphaned (orphan in project space if possible, to include the latter) and are not redirects would probably detect many of those and allow to cleanup any inappropriate content. If this is possible, monthly should be enough (as DumbBOT monitors recently created incomplete nominations). Cenarium (talk) 16:28, 4 April 2009 (UTC)
- Done. I didn't really understand your parenthetical ("orphan in project space if possible..."). Normally with over 1,000 results, I would add pagination. But since this report will likely decrease (dramatically) with subsequent runs and since there are only two columns in the table (No. and Page), I didn't see a need to paginate here. It's using external link syntax as a hack to avoid future reports screwing up the orphaned check. Cheers. --MZMcBride (talk) 09:56, 5 April 2009 (UTC)
I've just created this report. I don't deal with stubs, so I have no idea how useful it will be, but let me know if you think the page length check should be increased or decreased. Cheers. --MZMcBride (talk) 08:41, 5 April 2009 (UTC)
- I have a request...Is there anyway that template data can be ignored when running this script? Synergy 22:18, 24 April 2009 (UTC)
- On second thought, ignore that. My original concern was that these pages are so large due to template information. While this should be ignored, some articles will only have one to two paragraphs and templated data (meaning there will never be anymore information outside of updating templates since a base description is all that will ever be needed). Since this can't be right out ignored, its left up to the reviewer. /end babble /end rant Synergy 22:28, 24 April 2009 (UTC)
Hi MZMcBride, do you think it would be possible to periodically re-generate the list found at Special:MostLinkedTemplates. That list ceased to update itself a few months ago, and I found it quite useful for tracking widely used new templates so I could add them to my watchlist or check if they need protection. Since the official list isn't being regenerated at all, perhaps you could make a script replicates its functionality at Wikipedia:Database reports instead. Thanks, --CapitalR (talk) 02:59, 8 April 2009 (UTC)
- If you wanted to provide some advanced information (which I occasionally generate myself for specific templates), you could list the most used templates and include a grid with the number of transclusions by namespace. --CapitalR (talk) 03:03, 8 April 2009 (UTC)
- Done. (The first part, at least.) The second part may or may not be doable easily. Going to require some thinking. --MZMcBride (talk) 07:45, 24 April 2009 (UTC)
- Figured out the query for the second part. Now I just need to pick a report title. Do you have any preference? "Templates with the most transclusions (detailed)"? "Top transcluded templates by namespace"? Pick a fitting title and I'll add a report. --MZMcBride (talk) 08:55, 24 April 2009 (UTC)
Wikipedia:Database reports/Empty categories
Just wondering if a few more items could be excluded from the Wikipedia:Database reports/Empty categories report
- _articles_with_
- _articles_without_
- _articles_needing_
-- WOSlinker (talk) 13:06, 4 May 2009 (UTC)
- Done. Running it manually now. --MZMcBride (talk) 13:13, 4 May 2009 (UTC)
Plainlinks
Would it be possible to run a report on articles using "plainlinks" as span tag. Special:Search/plainlinks is probably not exhaustive. While this formatting is useful in template namespace, I don't think it should be in article namespace. -- User:Docu
- The Toolserver does not allow regular users to have access to page text. Special:Search is probably as accurate as you're going to get without getting nasty (the internal search index updates daily). (Nasty being pulling the page text 10 pages at a time and scanning it with a script. Requires about 691,233 hits to the API.) Probably not worth of it for something that's mainly cosmetic. Sorry I can't be more helpful. --MZMcBride (talk) 21:46, 4 May 2009 (UTC)
- I see, thanks for your reply. I'm still not entirely convinced that Special:Search gives an accurate result. I thought it might be easier to do this than for WP:CHECKWIKI. -- User:Docu
- 'nother idea... you could use the AbuseFilter to trigger an entry when the new page text contains "plainlinks". Over time, it would find more and more entries.... --MZMcBride (talk) 18:05, 5 May 2009 (UTC)
- There is a lot of things that could be done that way, but adding things to the filters is probably the most resource intensive way to do it. I think they should be kept to basic stuff like page blanking etc. -- User:Docu
- Well, you can't spell "misuse" without "plainlinks." Or something. ;-) --MZMcBride (talk) 18:59, 5 May 2009 (UTC)
- (edit conflict) MZM does have a point though. Plainlinks in an article is rarely appropriate and could be used to rickroll our readers and other juvenile nonsense. –xeno talk 19:00, 5 May 2009 (UTC)
- I do agree that there is no reason to add the span in article namespace. I'm just a bit hesitant about the number of results we get compared to the number of articles scanned, but let's try it. -- User:Docu
- There is a lot of things that could be done that way, but adding things to the filters is probably the most resource intensive way to do it. I think they should be kept to basic stuff like page blanking etc. -- User:Docu
- 'nother idea... you could use the AbuseFilter to trigger an entry when the new page text contains "plainlinks". Over time, it would find more and more entries.... --MZMcBride (talk) 18:05, 5 May 2009 (UTC)
- I see, thanks for your reply. I'm still not entirely convinced that Special:Search gives an accurate result. I thought it might be easier to do this than for WP:CHECKWIKI. -- User:Docu
Coordinates
Category:Templates generating hCards and Geo lists templates used by articles that should generally receive coordinates. I would be interested in knowing which of these articles don't transclude {{coord missing}} or {{coor URL}} yet. That means that they still need to be tagged with {{coord missing}}. If it helps, I can clean out the /doc subtemplates first, but it shouldn't matter, they aren't used in articles. -- User:Docu
The below should work. It should render at least Roma_Street_railway_station,_Brisbane. -- User:Docu 21:45, 5 May 2009 (UTC) - updated 21:59, 5 May 2009 (UTC)
SELECT p1.page_title, p2.page_title FROM categorylinks, page As P1, templatelinks As tl1, page AS P2
WHERE cl_to = 'Templates_generating_hCards_and_Geo'
AND cl_from = P1.page_id
AND p1.page_title = tl1.tl_title
AND tl1.tl_from = P2.page_id
AND P2.page_namespace = 0
AND P2.page_id NOT IN (SELECT tl_from FROM templatelinks WHERE tl_title = 'Coor_URL' OR tl_title = 'Coord_missing')
MySQL squawked about the capitalization of "P1" versus "p1". I ran the query with LIMIT 1 and these were the results:
mysql> SELECT p1.page_title, p2.page_title FROM categorylinks, page AS p1, templatelinks AS tl1, page AS p2
-> WHERE cl_to = 'Templates_generating_hCards_and_Geo'
-> AND cl_from = p1.page_id
-> AND p1.page_title = tl1.tl_title
-> AND tl1.tl_from = p2.page_id
-> AND p2.page_namespace = 0
-> AND p2.page_id NOT IN (SELECT tl_from FROM templatelinks WHERE tl_title = 'Coord' OR tl_title = 'Coord_missing')
-> LIMIT 1;
+--------------------+-------------------------+
| page_title | page_title |
+--------------------+-------------------------+
| Infobox_Skyscraper | 1000_Connecticut_Avenue |
+--------------------+-------------------------+
1 row in set (1 min 56.61 sec)
Seems a bit slow; I may be able to optimize the query a bit. Is this the type of output you want? And do you want an individual report (updated regularly) or do you want a one-time run (in a sandbox or something)? --MZMcBride (talk) 22:01, 5 May 2009 (UTC)
- The type of result is fine. There is neither coordinates nor {{coord missing}} in that article and its one that is suitable to be tagged. Ideally it would a weekly report. I'm not quite sure how many results to expect though. -- User:Docu
To simplify processing the result, would you add the categories of the resulting articles as well? Can the query below still run on toolserver?
SELECT p2.page_title As article, p1.page_title As infobox, GROUP_CONCAT(cl2.cl_to) As categories
FROM categorylinks as cl1, page As p1, templatelinks As tl1, page AS p2, categorylinks as cl2
WHERE cl1.cl_to = 'Templates_generating_hCards_and_Geo'
AND cl1.cl_from = p1.page_id
AND p1.page_title = tl1.tl_title
AND tl1.tl_from = p2.page_id
AND p2.page_namespace = 0
AND p2.page_id NOT IN
(SELECT tl2.tl_from FROM templatelinks As tl2 WHERE tl2.tl_title = 'Coor_URL' OR tl2.tl_title = 'Coord_missing')
AND p2.page_id = cl2.cl_from
GROUP BY p2.page_id
I'd be glad if you'd set it up to run once a week on toolserver. -- User:Docu
Current progress:
Id: 1758577
User: mzmcbride
Host: nightshade.toolserver.org:52148
db: enwiki_p
Command: Query
Time: 15352
State: preparing
Info: SELECT p1.page_title, p2.page_title FROM categorylinks, page AS p1, templatelinks AS tl1, page AS p2
WHERE cl_to = 'Templates_generating_hCards_and_Geo'
AND cl_from = p1.page_id
AND p1.page_title = tl1.tl_title
AND tl1.tl_from = p2.page_id
AND p2.page_namespace = 0
AND p2.page_id NOT IN (SELECT tl_from FROM templatelinks WHERE tl_title = 'Coor_URL' OR tl_title = 'Coord_missing')
Over four hours and still running. Hmm. --MZMcBride (talk) 04:22, 6 May 2009 (UTC)
- Aie .. the one dispenser is running once per week isn't quick either, but it's definitely worth it. BTW the second select statement has about 850,000 rows would it helpful to put this into a temporary table? -- User:Docu
And... this is what I get for not writing my own queries. ;-) The primary problem with your query is that it's not using an index when it selects from templatelinks. If you don't specify tl_namespace = 10, it requires scanning the entire table (about 134,616,320 rows). And doing it recursively. Ouch.
The other issue is that I'm fairly sure MySQL hates OR statements. I re-wrote the query (albeit before I saw you wanted the categories as well...).
mysql> SELECT
-> p1.page_title,
-> tl_title
-> FROM templatelinks
-> JOIN page AS p1
-> ON tl_from = p1.page_id
-> JOIN (SELECT
-> cl_from,
-> p2.page_title
-> FROM categorylinks
-> JOIN page AS p2
-> ON cl_from = p2.page_id
-> WHERE cl_to = "Templates_generating_hCards_and_Geo"
-> AND p2.page_namespace = 10) AS cltmp
-> ON cltmp.page_title = tl_title
-> WHERE tl_namespace = 10
-> AND p1.page_namespace = 0
-> AND page_id NOT IN (SELECT
-> tl_from
-> FROM templatelinks
-> WHERE tl_namespace = 10
-> AND tl_title ="Coord_missing")
-> AND page_id NOT IN (SELECT
-> tl_from
-> FROM templatelinks
-> WHERE tl_namespace = 10
-> AND tl_title ="Coor_URL")
-> LIMIT 1000;
+---------------------------------------------------------------+-----------------------------------+
| page_title | tl_title |
+---------------------------------------------------------------+-----------------------------------+
| Gijón | Infobox_City_Spain |
| Jerez_de_la_Frontera | Infobox_City_Spain |
| Almagro,_Ciudad_Real | Infobox_City_Spain |
| Bárcena_de_Pie_de_Concha | Infobox_City_Spain |
[Removed results here....]
| Shire_of_Capel | Infobox_Australian_Place |
| Shire_of_Carnamah | Infobox_Australian_Place |
| Shire_of_Carnarvon | Infobox_Australian_Place |
| Shire_of_Chapman_Valley | Infobox_Australian_Place |
| Shire_of_Collie | Infobox_Australian_Place |
| Shire_of_Cranbrook | Infobox_Australian_Place |
| Shire_of_Cue | Infobox_Australian_Place |
+---------------------------------------------------------------+-----------------------------------+
1000 rows in set (24.44 sec)
Seems to be much, much faster now. Do these results look correct?
I'm not sure how feasible this is for wiki pages. There are probably thousands and thousands of results. Do you want it in a different format (text file, CSV, etc.)? Or do you want me to throw a LIMIT on it? I should have the total number of results in a few minutes. --MZMcBride (talk) 04:57, 6 May 2009 (UTC)
- Currently there are about 156000 articles with {{coord missing}} [1]. It's easily possible that there are 10,000. CSV or text in a ZIP is fine. We would tag them all and next time there should be just new ones. -- User:Docu
- The entries from Spain and Australia are just what I was looking for. -- User:Docu
- The group_concat on categories would this work? -- User:Docu
- The GROUP_CONCAT should work. I ran the query initially without it (it's gonna require me re-writing the current one); the results from the query are here: tools:~mzmcbride/docu-coord.txt (7,664 rows). After you run through the list, a database report is no problem. Just ping me when you're ready. --MZMcBride (talk) 05:34, 6 May 2009 (UTC)
- Thanks for your help, this is good news. I will have to sample to the results and possibly remove one or the other infobox. I will also get some feedback from The Anome. All this is likely to take some time. For some infoboxes, the category is not needed to add the country to {{coord missing}}, but for others, it would be needed and it's easier to determine it from the category rather than from the article when adding the template. -- User:Docu
- I re-ran the query using GROUP_CONCAT. The query looks ugly, but only takes about ten minutes to run. Results are available here: tools:~mzmcbride/docu-coord2.txt.gz (7,517 results; gzipped so it won't accidentally crash your browser).
SELECT
lgtmp.page_title,
lgtmp.tl_title,
GROUP_CONCAT(ctgl.cl_to)
FROM categorylinks AS ctgl
JOIN (SELECT
p1.page_id,
p1.page_title,
tl_title
FROM templatelinks
JOIN page AS p1
ON tl_from = p1.page_id
JOIN (SELECT
cl_from,
p2.page_title
FROM categorylinks
JOIN page AS p2
ON cl_from = p2.page_id
WHERE cl_to = "Templates_generating_hCards_and_Geo"
AND p2.page_namespace = 10) AS cltmp
ON cltmp.page_title = tl_title
WHERE tl_namespace = 10
AND p1.page_namespace = 0
AND page_id NOT IN (SELECT
tl_from
FROM templatelinks
WHERE tl_namespace = 10
AND tl_title ="Coord_missing")
AND page_id NOT IN (SELECT
tl_from
FROM templatelinks
WHERE tl_namespace = 10
AND tl_title ="Coor_URL")) AS lgtmp
ON lgtmp.page_id = ctgl.cl_from
GROUP BY lgtmp.page_title;
- It's just what I was looking for. BTW I noticed that my 2nd source skips articles without categories (about 100). To add them to your last version too, the first join should be a right join. For now I already got them from the first run. -- User:Docu (6 May 2009)
SELECT
lgtmp.page_title,
lgtmp.tl_title,
GROUP_CONCAT(ctgl.cl_to)
FROM categorylinks AS ctgl
RIGHT JOIN (SELECT
p1.page_id,
p1.page_title,
tl_title
FROM templatelinks
JOIN page AS p1
ON tl_from = p1.page_id
JOIN (SELECT
cl_from,
p2.page_title
FROM categorylinks
JOIN page AS p2
ON cl_from = p2.page_id
WHERE cl_to = "Templates_with_coordinates_fields"
AND p2.page_namespace = 10) AS cltmp
ON cltmp.page_title = tl_title
WHERE tl_namespace = 10
AND p1.page_namespace = 0
AND page_id NOT IN (SELECT
tl_from
FROM templatelinks
WHERE tl_namespace = 10
AND tl_title ="Coord_missing")
AND page_id NOT IN (SELECT
tl_from
FROM templatelinks
WHERE tl_namespace = 10
AND tl_title ="Coor_URL")) AS lgtmp
ON lgtmp.page_id = ctgl.cl_from
GROUP BY lgtmp.page_title;
-- Updated version above. -- User:Docu
Number of users by range of number of edits and range of time since first edit
To help with the decisions at Wikipedia:Reviewers. Run as a one-off (maybe once per year in case there are any trends)? AndrewRT(Talk) 01:11, 9 May 2009 (UTC)
- I would suggest one year / 1000 edits for a first run. It's not a 'normal' database report, though. If we want to use the list efficiently for this purpose, it should skip admins (as they have the rights by default), and later skip reviewers, and show only users with recent edits. Cenarium (talk) 01:48, 9 May 2009 (UTC)
- Quite confused as to what exactly you want to get. There are about 9 million registered users. Most haven't ever edited. Do you want a table of User / Edit count / Date of first edit? If so, for all users? For users with above a certain edit count? Or users who have edited in the past year? (And what if they don't have any edits, but instead only have some sort of log action? Probably much easier to focus just on edits, though.) And as a side note to my side note, I should mention that when dealing with values like first edit, non-deleted edits are inaccessible to Toolserver users, so all data assumes the edit is still "live." Thus, the date of first edit may not correspond with the actual first edit, but it's usually close enough. "All users with over 1,000 edits and an edit within the past year who aren't in the bot or admins user groups"? That sound about right? --MZMcBride (talk) 04:40, 10 May 2009 (UTC)
- The aim is to have a list of users to be considered for being granted reviewer rights (and before that, for analysis), based on the edit count and the date since registration or first edit. So admins, bots and users who haven't recently edited should be skipped. The registration date is used in the FlaggedRevs autopromotion, I think we can use it too. So I'd say "Users with more than 1000 edits, registered for over a year and with at least one edit in the last month, who are not bots or admins." . Cenarium (talk) 19:50, 10 May 2009 (UTC)
- Quite confused as to what exactly you want to get. There are about 9 million registered users. Most haven't ever edited. Do you want a table of User / Edit count / Date of first edit? If so, for all users? For users with above a certain edit count? Or users who have edited in the past year? (And what if they don't have any edits, but instead only have some sort of log action? Probably much easier to focus just on edits, though.) And as a side note to my side note, I should mention that when dealing with values like first edit, non-deleted edits are inaccessible to Toolserver users, so all data assumes the edit is still "live." Thus, the date of first edit may not correspond with the actual first edit, but it's usually close enough. "All users with over 1,000 edits and an edit within the past year who aren't in the bot or admins user groups"? That sound about right? --MZMcBride (talk) 04:40, 10 May 2009 (UTC)
The user_registration field isn't really very accurate. For example, Versageek is an admin and registered a long time ago (before me, even), yet here's her info:
mysql> SELECT * FROM user WHERE user_name="Versageek";
+---------+-----------+-------------------+----------------+
| user_id | user_name | user_registration | user_editcount |
+---------+-----------+-------------------+----------------+
| 146418 | Versageek | NULL | 19408 |
+---------+-----------+-------------------+----------------+
1 row in set (0.00 sec)
There are a lot of NULL values in the field. I'll use MIN(rev_timestamp). --MZMcBride (talk) 20:03, 10 May 2009 (UTC)
- Done. I imagine you'll want to modify the parameters a bit—just let me know. I didn't set it to a schedule yet, either. I don't know how often you want it updated. Also, sample a few of the results randomly to ensure accuracy of the data. :-) --MZMcBride (talk) 01:25, 11 May 2009 (UTC)
- Great, thanks :). The list looks accurate. There won't be a need to update it regularly before the implementation goes live. Cenarium (talk) 02:08, 11 May 2009 (UTC)
"Excessively long blocks" should be "Unusually"
Yes, these blocks are unusually long, but some aren't excessive, e.g. the IP blocks of Scientology IPs per the recent Arbcom case. ⟳ausa کui × 11:31, 21 June 2009 (UTC)
- Acknowledging that I saw this. It's on my to-do list. I see four reports that need to be modified. That sound about right? Cheers. --MZMcBride (talk) 02:35, 23 June 2009 (UTC)
- Done. --MZMcBride (talk) 22:14, 1 July 2009 (UTC)
Templates with the most transclusions
Just wondering if it would also be possible to show the number of pages that the template is used on as well as the number of transclusions.
Would this SQL do it:
SELECT tl_title, COUNT(*) AS transclusion_count, COUNT(DISTINCT tl_from) AS page_count FROM templatelinks WHERE tl_namespace = 10 GROUP BY tl_title ORDER BY COUNT(*) DESC LIMIT 1000;
??? -- WOSlinker (talk) 20:50, 30 June 2009 (UTC)
- Hmm, not sure I understand you. If {{fact}} is used twice in an article, it's only listed once in Special:WhatLinksHere as a transclusion. --MZMcBride (talk) 20:52, 30 June 2009 (UTC)
- The reason I'm asking is due to a discussion here about Transclusion count. -- WOSlinker (talk) 20:55, 30 June 2009 (UTC)
- Is it also counted once on Wikipedia:Database reports/Templates with the most transclusions? — Edokter • Talk • 21:08, 30 June 2009 (UTC)
- As far as I'm aware. The query used is available at Wikipedia:Database reports/Templates with the most transclusions/Configuration. I was going to run a proper test, but apparently the Toolserver's testwiki_p isn't replicating properly. --MZMcBride (talk) 21:38, 30 June 2009 (UTC)
Protection-related reports
Not sure if anyone has noticed, but the protection-related reports have been broken for some time. It seems the query they were using stopped working well; any instance of it hung indefinitely. Probably a MySQL bug or something. Anyway, with this commit I've updated all six reports to now use the API to pull their data. Preliminary tests indicate that the scripts work, but not all of them have been tested. Things like missing titles may break the reports entirely; I didn't have a chance to fully stress-test.
The reports will update at the beginning of Thursday UTC. If you notice any strangeness (or a lack of updating), please shout here or on my talk page. Cheers. --MZMcBride (talk) 08:24, 14 July 2009 (UTC)
Toolserver data corruption
On July 2, the Toolserver had a power outage causing data corruption in the S1 server (which holds en.wiki's database and a copy of Commons' database). This is a documented breakage and there's currently no estimated time for a re-import of the databases. Because there is data corruption, some reports will be inaccurate. Please act accordingly. Apologies in advance for the inconvenience. --MZMcBride (talk) 14:36, 14 July 2009 (UTC)
- oFF WITH 'IS HEAD ! –xenotalk 14:42, 14 July 2009 (UTC)
- S1 is being re-imported right now. --MZMcBride (talk) 08:23, 1 August 2009 (UTC)
- Yay! --Closedmouth (talk) 15:54, 1 August 2009 (UTC)
- S1 is being re-imported right now. --MZMcBride (talk) 08:23, 1 August 2009 (UTC)
Resurrecting the non-free image abuse report
This may exist already. This report used to trawl the database and report any article which contained ten or more non-free images - it looked like this. Run monthly? Black Kite 08:02, 22 July 2009 (UTC)
- Should be doable. Is there a reason to limit to articles? I'd also like to see a report like this one run regularly. So perhaps two reports: (1) "Pages containing an unusually high number of non-free files" and "Non-free files used unusually frequently" or something? The report titles could use work. Your thoughts? --MZMcBride (talk) 19:48, 22 July 2009 (UTC)
- Yup, that'd be perfect. No reason to limit to articles, although might be worth having a separate one for non-mainspace pages because there shouldn't be any non-free stuff there at all. Black Kite 09:12, 1 August 2009 (UTC)
Link format
I'd love to run AWB over these reports but it doesn't see the links, just templates. Please fix this or create a raw toolserver tool to use.--Ipatrol (talk) 19:33, 22 July 2009 (UTC)
- I have no idea what you're talking about. --MZMcBride (talk) 19:34, 22 July 2009 (UTC)
- I was unable to replicate this, using "Links on page" brings up an appropriate worklist. Please elucidate. –xenotalk 23:11, 18 August 2009 (UTC)
I know that technically it's not related to this page, but could you update Wikipedia:WikiProject Stub sorting/Stub type sizes/data on a monthly basis? עוד מישהו Od Mishehu 08:25, 17 August 2009 (UTC)
- Technically you want my talk page, but I clearly watch here too. Should be pretty trivial to update. Did you want a sortable chart or the same format it uses now? And, what are the exact criteria for that list? Do you just look at all categories ending in "stubs" or is there a master category somewhere? --MZMcBride (talk) 08:45, 17 August 2009 (UTC)
- The user who was originally responsible for that list, Alai, is no longer active, so I can only guess what his intentions were. It would appear, based on this revision that:
- He didn't only use the categories ending with the word "stubs", since Category:Stubs by region is included on the list.
- The fact that Category:Mixed Drinks stubs appeared on the list then (it was only a category redirect + categorization in Category:WikiProject Mixed Drinks redirects at the time) shows that he didn't only go by some category. The only way he could have found this was through looking for categories ending with the word "stubs".
- I think, based on this, that the list should be the content of Category:Stub categories (until January this year, Category:Stubs by region was in it), and any category ending with the word "stubs".
- As to the format, the current one is good enough, although I personally would prefer (Alai didn't do this) that if 2 categories have the same number of articles, that the one with fewer subcategories goes first. (I have no way to check now whether these numbers reflect articles or all non-category pages; I think articles is the correct one). Please also note that the first 2 sections only apply to categories with no subcategories.
- עוד מישהו Od Mishehu 12:14, 17 August 2009 (UTC)
- After looking at the contents of Category:Stub categories, I thought it best to just include every category that contains "Stubs" or "stubs" (and that has a corresponding category description page). The report is finished, but it's pretty large (over 700,000 bytes). Do you want to exclude any data or do you want pagination? Are there any other problems with the report that need to be addressed? I won't set up a cronjob until I know you're satisfied with the output. --MZMcBride (talk) 00:11, 20 August 2009 (UTC)
- Looks good. I can't say why Alai did it on a single page, but it means that manually updating specific cases where I know there was a major change (such as when creating a new stub category) is easier. עוד מישהו Od Mishehu 07:45, 20 August 2009 (UTC)
- Set to monthly. For the future, it may be nice to use subpages with transclusion. We'll revisit that when this report gets too big to post to a wiki page, though. --MZMcBride (talk) 23:04, 22 August 2009 (UTC)
- Looks good. I can't say why Alai did it on a single page, but it means that manually updating specific cases where I know there was a major change (such as when creating a new stub category) is easier. עוד מישהו Od Mishehu 07:45, 20 August 2009 (UTC)
- After looking at the contents of Category:Stub categories, I thought it best to just include every category that contains "Stubs" or "stubs" (and that has a corresponding category description page). The report is finished, but it's pretty large (over 700,000 bytes). Do you want to exclude any data or do you want pagination? Are there any other problems with the report that need to be addressed? I won't set up a cronjob until I know you're satisfied with the output. --MZMcBride (talk) 00:11, 20 August 2009 (UTC)
- The user who was originally responsible for that list, Alai, is no longer active, so I can only guess what his intentions were. It would appear, based on this revision that:
Mini stubs that have not been edited since creation
Per this discussion, it is possible to have a trawl through the data base to find how many stubs of less than 250bytes have not been edited in the last twelve months? It may help the discussion to see how many there are, and for a list so a few examples could be reviewed. If this is not the correct venue, could some kind soul advise me what is. Cheers, LessHeard vanU (talk) 20:40, 21 August 2009 (UTC)
- Sounds like you want a list of rows in the page table WHERE page_namespace = 0 (articles) AND page_len < 250 (less than 250 bytes) AND page_is_redirect = 0 (not a redirect) AND SELECT MAX(rev_timestamp) < (Today - 1 year) and uses a category that ends in "stubs"? Is there a better way to identify stubs beside the looking at the category membership? Or, perhaps I don't need to look for stubs at all if you just wanted short pages. Thoughts? --MZMcBride (talk) 20:45, 21 August 2009 (UTC)
- Oh, one more minor point. Your header mentions "not edited since creation" but you don't seem to discuss this. You just want pages not recently edited, right? Otherwise, there are two ways to do "not edited since creation": (1) only list pages with 1 revision (bound to miss a lot of pages); or (2) only list pages with a single author (will still miss some pages, but not as many). I think you just want "no edits in the past year," but clarification would be nice. --MZMcBride (talk) 20:48, 21 August 2009 (UTC)
- Is it easier to crawl stub categories or simply article categories? I suppose some articles may not have been templated as stubs, so all articles under 250k would be fine. What I am looking for is the type of mini stub that was created, may have had a bot place a template or had a minor edit, but has languished unloved since. Are there parameters that would allow listing articles that have only been edited in the first month since creation, and not in the next 11+ months? I am thinking that any article of less than 250k will not have had major edits anyhow (except an article reduced to stub size... perhaps a parameter is that it never exceeded 250k). Whilst I am wishing for the moon, is there a chance of listing the total views in the past 12 months, another indicator of potential worth? Thanks, LessHeard vanU (talk) 20:59, 21 August 2009 (UTC)
- Here are the initial results: tools:~mzmcbride/lessheard.txt. That's all non-redirect pages in NS:0 that are less than 250 bytes and haven't been edited in the past year. I excluded disambiguation pages because they were causing far too many false positives. I can do further queries for things like "never been over 250 bytes," but given the small data set from this query, I'm not sure further queries are really necessary. Though, if you really want the data, it should be possible at some point.
I'm not sure what the actual objective of this list was / is; if that's made more clear, perhaps I can find another way to get the data you're after. 1500-ish random pages doesn't seem to be your primary goal. --MZMcBride (talk) 09:00, 22 August 2009 (UTC)
- That is excellent, thank you very much. I am off to badger some people over at the Village pump about the necessity of reviewing and weeding some of these. I presume the list will remain until no longer needed? If there are space considerations I will gladly host it on a sub page of mine. Again, thanks. LessHeard vanU (talk) 16:34, 22 August 2009 (UTC)
- Saved at User:LessHeard vanU/Mini stubs with some formatting. --MZMcBride (talk) 16:44, 22 August 2009 (UTC)
- The motives behind this (as I understand it) is to identify low value articles that may actually benefit from being red linked. (both to not leave the reader left wanting and to restore the encouragement that comes with redlinks - to provide a useful, substantive article, rather than a sub/stub...) –xeno talk 18:36, 22 August 2009 (UTC)
- That is excellent, thank you very much. I am off to badger some people over at the Village pump about the necessity of reviewing and weeding some of these. I presume the list will remain until no longer needed? If there are space considerations I will gladly host it on a sub page of mine. Again, thanks. LessHeard vanU (talk) 16:34, 22 August 2009 (UTC)
- Here are the initial results: tools:~mzmcbride/lessheard.txt. That's all non-redirect pages in NS:0 that are less than 250 bytes and haven't been edited in the past year. I excluded disambiguation pages because they were causing far too many false positives. I can do further queries for things like "never been over 250 bytes," but given the small data set from this query, I'm not sure further queries are really necessary. Though, if you really want the data, it should be possible at some point.
- Is it easier to crawl stub categories or simply article categories? I suppose some articles may not have been templated as stubs, so all articles under 250k would be fine. What I am looking for is the type of mini stub that was created, may have had a bot place a template or had a minor edit, but has languished unloved since. Are there parameters that would allow listing articles that have only been edited in the first month since creation, and not in the next 11+ months? I am thinking that any article of less than 250k will not have had major edits anyhow (except an article reduced to stub size... perhaps a parameter is that it never exceeded 250k). Whilst I am wishing for the moon, is there a chance of listing the total views in the past 12 months, another indicator of potential worth? Thanks, LessHeard vanU (talk) 20:59, 21 August 2009 (UTC)
Pages with links to redirects to themselves
I'd like a monthly report on all pages which fit the following cirteria:
- Some link on the page (including a navigation template on the page) is to a redirect back to the page itself.
- The page doesn't have the {{Documentation}}, {{shortcut}}, {{Shortcut-l}}, {{Policy shortcut}}, {{Ombox/Shortcut}} or {{Template shortcut}} templates. The first transcludes documentation which frequently links to redirects, and the others are explicitly intended for shortcuts, which are implemented using redirects. עוד מישהו Od Mishehu 08:38, 27 August 2009 (UTC)
- This was a nifty one. Took a few hours to get it to work completely correctly, but I seem to have it now. Do you want all pages are only pages in a specific namespace? For example, most of the templates you mention would never be used in the article namespace, and I imagine most of this work would focus on articles? (Or maybe you only want template pages?) From really poor extrapolation, it looks like the number of results for all namespaces could be around 1,000,000, though I'm querying right now for the exact figure. Regardless, this data set clearly won't fit on a wiki page reasonably. How would you like the data? Would a text file work? --MZMcBride (talk) 18:23, 28 August 2009 (UTC)
- Follow-up: Looks like there are about 111,000 results for all namespaces (and without any template use restrictions). To my knowledge, that includes multiple entries for a particular page, so if Page1 contains Page2 and Page3, both of which are redirects to back to Page1, that would count as two entries. --MZMcBride (talk) 18:10, 29 August 2009 (UTC)
- I believe that most pages in the Wikipedia: namespace have links to shortcuts (redirects) on themselves, and several Template: pages (such as {{uncategorized}}) list redirects to the templates. עוד מישהו Od Mishehu 10:06, 30 August 2009 (UTC)
- The raw data is available here: tools:~mzmcbride/page-redirects-zip.txt.gz. I'm not really sure how you want me to proceed. As I noted, it's over 100,000 results, which obviously can't be placed on wiki pages. Some further refining could be done, based on this list (checking page_ids against page_ids that use certain templates, for example). --MZMcBride (talk) 23:41, 30 August 2009 (UTC)
- I believe that most pages in the Wikipedia: namespace have links to shortcuts (redirects) on themselves, and several Template: pages (such as {{uncategorized}}) list redirects to the templates. עוד מישהו Od Mishehu 10:06, 30 August 2009 (UTC)
Query used:
SELECT
pg5.page_id, pg5.page_namespace, pg5.page_title, /* page id, namespace, and title */
pg6.page_namespace, pg6.page_title /* redirect namespace and title */
FROM page AS pg6
JOIN (SELECT
*
FROM page AS pg3
JOIN (SELECT
pl_from,
rd_namespace,
rd_title,
rd_from
FROM page AS pg1
JOIN (SELECT
pl_from,
page_id
FROM pagelinks
JOIN page AS pg4
ON pl_namespace = pg4.page_namespace
AND pl_title = pg4.page_title) AS pgl1
ON pgl1.page_id = pg1.page_id
JOIN redirect
ON rd_from = pg1.page_id) AS pg2
ON pg2.pl_from = pg3.page_id
WHERE rd_title = page_title
AND rd_namespace = page_namespace) AS pg5
ON pg5.rd_from = pg6.page_id;
Almost identical to my Self redirects request, the only problem with the queries is the section is not included in the database. So some of these might be valid redirects to sections. — Dispenser 07:02, 1 September 2009 (UTC)
- Looking down that list, I found the following major groups which should be excluded:
- Odd number namespaces (talk pages).
- Pages in the Wikipedia: namespace - these are probably due to the shortcuts listed on the pages. I figured this much before making the request, which is why I explicitly eanted to exclude pages with various shortcut templates. This should also reduce the number of pages from the User: and Category: namespaces.
- Pages in the Template: namespace - these are probably in the documentation. If the documentation is on the page using the {{documentation}} template, then everything is okay (I had this on the list of exclusion templates). If not, thjen this is a handle for us to find templates with documentation on the same page.
- Foreign relations - Here, the articles are "Foreign relations of X", and the redirects are "X-Y relations". I think thesae should also be excluded.
- Once these groups are excluded, how much better is the situation? עוד מישהו Od Mishehu 12:05, 3 September 2009 (UTC)
- By excluding pages in namespaces 1,3,4,5,7,9,10,11,13,15,101, there are now approximately 97,000 results: tools:~mzmcbride/page-redirects-zip-2.txt.gz. --MZMcBride (talk) 07:15, 10 September 2009 (UTC)
Templates that link to redirects
Per WP:R2D ("exceptions"), templates such as navboxes shouldn't link to redirects, as it can create self-referential linking similar to that described in the section above. A list of templates that link to redirects would be helpful for fixing these (maybe weekly or biweekly?) –Drilnoth (T • C • L) 20:58, 31 August 2009 (UTC) My bad, I see that this was within the scope of the last requests. Having a separate list for this would be helpful, though. –Drilnoth (T • C • L) 21:04, 31 August 2009 (UTC)
- Yes, somewhat similar to the above request. One issue to consider is presentation. If a particular template has multiple redirects in it, I suppose the best option is to use GROUP_CONCAT() which would put all the links in a comma-delimited format with one row per template. Or, you could just have multiple rows per template. Which would you prefer? --MZMcBride (talk) 21:07, 31 August 2009 (UTC)
- The former; one would go through the list one template at a time, not one redirect at a time. –Drilnoth (T • C • L) 21:12, 31 August 2009 (UTC)
- Another point of clarification: the above request was looking for pages where the page contained a link to a redirect and the redirect pointed back to the original page. For the report you requested in this section, you wouldn't care about the target of the redirect, right? You're looking for templates that simply contain links to redirects. I think you're going to have an extraordinarily high number of false positives if you do this. Perhaps you want to focus on a specific subset of templates, like only navboxes? (Is there a tracking category or something for navboxes?) For example, Template:Plnr contains a redirect link (WP:DBR). You wouldn't want that listed in a report, I think. So perhaps another check would be for links that are in namespace 0 only? Not sure how valuable this report will be given the possible high level of noise.... --MZMcBride (talk) 01:45, 1 September 2009 (UTC)
- The former; one would go through the list one template at a time, not one redirect at a time. –Drilnoth (T • C • L) 21:12, 31 August 2009 (UTC)
Files without templates
If a file doesn't have a template on it, it usually doesn't have a license. Having a list of these (perhaps weekly) would greatly help in tracking down copyright violations. –Drilnoth (T • C • L) 20:59, 31 August 2009 (UTC)
- This one is sort of on my to-do list (it got postponed due to database issues). My idea was to find all files that have no categories or templates and don't exist on Commons. Do you think two reports would be better? That is, "Files containing no templates" and "Files containing no templates or categories"? --MZMcBride (talk) 21:03, 31 August 2009 (UTC)
- Having both of those would work well. –Drilnoth (T • C • L) 21:05, 31 August 2009 (UTC)
Files that have little content
Files which consist only of headlines and templates, other than {{Information}} and the various templates in Category:Non-free use rationale templates, usually lack essential information such as a source. Having a list of these (maybe biweekly?) would be helpful in tracking down copyvios. –Drilnoth (T • C • L) 21:01, 31 August 2009 (UTC)
- tools:~erwin85/shortpages.php may be better for this as it's dynamic.... --MZMcBride (talk) 21:04, 31 August 2009 (UTC)
- Ah, thanks for the link. I hadn't known about that; it will be most helpful. Thank you! –Drilnoth (T • C • L) 21:06, 31 August 2009 (UTC)
- Giving this a bit more thought, I think there's some virtue to creating a true database report. (1) You can't run things like AWB on a dynamic list from the Toolserver; and (2) the shortpages tool doesn't factor in Commons existence, which may have a substantial impact. I would like to see this request clarified a bit, though. Right now as written, I would only pull pages in namespace 6 that aren't redirects that are simply short (page_len < 50 or something). Is there anything more specific that you're looking for that wouldn't be covered by another section on this page (e.g., pages that don't use templates or categories would be covered by another section on this page)? --MZMcBride (talk) 01:14, 1 September 2009 (UTC)
- Many images contain a headline like == Licensing: == and a copyright template, with no source or other information. That's why in my original request I mentioned that it would list only those which contain only headers and templates (no non-template, non-headline text), but counting {{Information}} and the like as plain (non-template) text. This is more of a source-related issue than a license-related one... whereas categorized images and templated images often have licenses, they can lack source information, which would be what this scan would pick up. I'm not sure if that helps. –Drilnoth (T • C • L) 20:00, 1 September 2009 (UTC)
- I'm not sure a database report is exactly the right solution for you here. It sounds like you want a bot to go through file description pages and use some sort of regex to look for specific missing info (like a lack of source information). Regular Toolserver users don't have direct page text access; you'd likely need to query the API for each page and check the current page text. All of this is of course possible, but probably falls outside the scope of Database reports. BOTREQ might be a good resource. That said, if you wanted, for example, all pages that are particular page length (a header and a template have a specific page length size that seems fairly predictable, like page_len < 500 bytes) and only have X revisions or only have a single author, list generation like that is fairly trivial. (Just as an example, you could get a list of all files where the page_len < 250 and that only have a single revision and are non-free. A small page length like that would generally indicate not enough source information and the one revision check would ensure that you're not dealing with any file description pages that have been vandalized.) --MZMcBride (talk) 20:21, 1 September 2009 (UTC)
- Ah, okay; I hadn't been aware of the page-text restriction. In that case, the toolserver page that you linked to and a "no-category no-template" list should be more than enough. Thanks! –Drilnoth (T • C • L) 00:39, 2 September 2009 (UTC)
- Done (see section above). --MZMcBride (talk) 03:34, 10 September 2009 (UTC)
- Ah, okay; I hadn't been aware of the page-text restriction. In that case, the toolserver page that you linked to and a "no-category no-template" list should be more than enough. Thanks! –Drilnoth (T • C • L) 00:39, 2 September 2009 (UTC)
- I'm not sure a database report is exactly the right solution for you here. It sounds like you want a bot to go through file description pages and use some sort of regex to look for specific missing info (like a lack of source information). Regular Toolserver users don't have direct page text access; you'd likely need to query the API for each page and check the current page text. All of this is of course possible, but probably falls outside the scope of Database reports. BOTREQ might be a good resource. That said, if you wanted, for example, all pages that are particular page length (a header and a template have a specific page length size that seems fairly predictable, like page_len < 500 bytes) and only have X revisions or only have a single author, list generation like that is fairly trivial. (Just as an example, you could get a list of all files where the page_len < 250 and that only have a single revision and are non-free. A small page length like that would generally indicate not enough source information and the one revision check would ensure that you're not dealing with any file description pages that have been vandalized.) --MZMcBride (talk) 20:21, 1 September 2009 (UTC)
- Many images contain a headline like == Licensing: == and a copyright template, with no source or other information. That's why in my original request I mentioned that it would list only those which contain only headers and templates (no non-template, non-headline text), but counting {{Information}} and the like as plain (non-template) text. This is more of a source-related issue than a license-related one... whereas categorized images and templated images often have licenses, they can lack source information, which would be what this scan would pick up. I'm not sure if that helps. –Drilnoth (T • C • L) 20:00, 1 September 2009 (UTC)
- Giving this a bit more thought, I think there's some virtue to creating a true database report. (1) You can't run things like AWB on a dynamic list from the Toolserver; and (2) the shortpages tool doesn't factor in Commons existence, which may have a substantial impact. I would like to see this request clarified a bit, though. Right now as written, I would only pull pages in namespace 6 that aren't redirects that are simply short (page_len < 50 or something). Is there anything more specific that you're looking for that wouldn't be covered by another section on this page (e.g., pages that don't use templates or categories would be covered by another section on this page)? --MZMcBride (talk) 01:14, 1 September 2009 (UTC)
- Ah, thanks for the link. I hadn't known about that; it will be most helpful. Thank you! –Drilnoth (T • C • L) 21:06, 31 August 2009 (UTC)
Self-made images without metadata
The lack of metadata in an image is a common sign of copyvios. Any image that is tagged as being available under one of the "self" free licenses, like {{PD-self}}, {{PD-user}}, {{GFDL-self}}, or just the plain old {{Self}}, and which has no metadata, would make a useful list for finding copyvios. Perhaps biweekly? –Drilnoth (T • C • L) 21:03, 31 August 2009 (UTC)
- I believe there's a broad category for all self-made works. That would be much easier to work with rather than individual templates (and all of their redirects). Can you find out if such a category exists? A lack of metadata should be an easy thing to check for, though I'll need to examine the image table a bit further.... --MZMcBride (talk) 21:10, 31 August 2009 (UTC)
- I forgot about the category! Category:Self-published work. –Drilnoth (T • C • L) 21:11, 31 August 2009 (UTC)
- I'm a bit hesitant about this report for a few reasons. (1) I'm not sure there's a strong correlation between a lack of metadata and a copyright violation (though, admittedly, files are not my strong area); (2) there are potentially thousands of results, which means I'll likely have to place an arbitrary limit on the output of the report (which isn't necessarily a bad thing, but...); (3) there's no real way to mark files as "checked," so I worry about a lot of duplicate effort. I ran the query with a low limit (30 results). I'd like you to take a look at the results and tell me how helpful they are. If you think there should be a report, I'd be more than happy to set one up. --MZMcBride (talk) 01:39, 1 September 2009 (UTC)
SELECT
CONCAT('# [[:File:',page_title,']]')
FROM page
JOIN image
ON img_name = page_title
JOIN categorylinks
ON cl_from = page_id
WHERE cl_to = 'Self-published_work'
AND page_namespace = 6
AND page_is_redirect = 0
AND img_metadata = '0'
LIMIT 30;
+-------------------------------------------------------------------------------------------------------------------------+
| CONCAT('# File:',page_title,'') |
+-------------------------------------------------------------------------------------------------------------------------+
| # File:"Bebop-rebop"_early_bop_phrase.png |
| # File:"Riven_Rock"_wine_label_by_Alden_Marin.jpg |
| # File:'Standing_Youth',_cast_stone_sculpture_by_Wilhelm_Lehmbruck,_1913,_Museum_of_Modern_Art_(New_York_City).jpg |
| # File:(Az)Xeus.jpg |
| # File:-Logo_of_the_15_helicopter_battalion_and_battalion_of_air_transport.jpg |
| # File:-model-baxe.jpg |
| # File:001front.jpg |
| # File:009w.jpg |
| # File:01201_1t.jpg |
| # File:014w.jpg |
| # File:016w.jpg |
| # File:018w.jpg |
| # File:020202.JPG |
| # File:021w.jpg |
| # File:02210506l.jpg |
| # File:023_23.jpg |
| # File:023w.jpg |
| # File:024w.jpg |
| # File:03_heritage_monument.JPG |
| # File:033_33.jpg |
| # File:035_35.jpg |
| # File:036_36.jpg |
| # File:037_37.jpg |
| # File:04_Jan_2003_(11).jpg |
| # File:048_48.jpg |
| # File:049_49.jpg |
| # File:05_Flag_Presentation.jpg |
| # File:050_50.jpg |
| # File:051206d.JPG |
| # File:05deciraq2.gif |
+-------------------------------------------------------------------------------------------------------------------------+
30 rows in set (0.01 sec)
Thanks for running this. I'll take a look through them in the next couple days and see if I can refine the request. Thanks! –Drilnoth (T • C • L) 20:00, 1 September 2009 (UTC)
- Yes, the problem is that even a lot of self-made work doesn't have metadata because people have digitally enhanced or cropped the work, or converted it to a different file format. I've just looked at my last five uploads and none of them have metadata because I've cropped and retouched them. A thought though - a lot of copyvio uploads are ones of living people that people just take off the net - I wonder if a report that produced a list of the following criteria might be useful? A list of files that have no metadata, are templated as self-published but are used in an article in the category Category:Living People ? Black Kite 20:52, 1 September 2009 (UTC)
That's a pretty inspired idea. Assuming I wrote this query correctly, some sample data is provided below. Helpful? Unhelpful? --MZMcBride (talk) 23:00, 1 September 2009 (UTC)
+--------------------------------------------------------------------------------------------------------------------------+
| CONCAT('',pg.page_title,' | File:',il_to,'') |
+--------------------------------------------------------------------------------------------------------------------------+
| Alden_Marin | File:"Riven_Rock"_wine_label_by_Alden_Marin.jpg |
| Chong_Chul_Rhee | File:070616-Janggeom-Master.jpg |
| Huang_Yong_Ping | File:100_Arme_der_Guan-yin.jpg |
| Sui_Jianguo | File:100sj.jpg |
| Peter_Dobbie | File:10728-225-017t.jpg |
| Brian_Chase | File:1161-02chasemisterka14.jpg |
| Bruno_Wizard | File:1178-06homo09.jpg |
| Lisle_Ellis | File:1255-04_lisle13.jpg |
| Jack_Dishel | File:1342moldy09.jpg |
| Jan_Inge_Reilstad | File:149px-JanInge.jpg |
| Giorgio_Gomelsky | File:1528_giorgio08_crop.jpg |
| Chris_Caffery | File:16CC16a_CC_s.jpg |
| András_Petöcz | File:19-gurul_500.jpg |
| Wray_Carlton | File:1965WrayCarlton.png |
| Sonny_Black | File:1978-m07-d00_-_Bracknell_Folk_Festival_-_Bill_Boazman.jpg |
| Dave_Concepción | File:198009-012WillieMcCoveyLastCandlestick.gif |
| Willie_McCovey | File:198009-012WillieMcCoveyLastCandlestick.gif |
| Fawzi_Al_Shammari | File:19810_W400XH600.jpg |
| Maggie_Holland | File:1983-m08-d07_-_Goodwood_Folk_Festival_-_English_Country_Blues_Band_(Maggie_Holland).jpg |
| Boris_Dragojevic | File:1993.Leute_Moj..._47x37cm.jpg |
| George_W._Della,_Jr. | File:1gdella.jpg |
| D._Page_Elmore | File:1pelmore.jpg |
| Perry_McCarthy | File:2003_pq_paddocks-perry-mccarthy.jpg |
| Sean_Gabb | File:2003-08-sig-sweden.jpg |
| Fred_Wedlock | File:2003-m06-d21_(11)_-_Somerset_-_Fred_Wedlock_(Allerford_M-8bd.jpg |
| Mia_Michaels | File:2005_Mia-Michaels_face.jpg |
| Horacio_Lavandera | File:20050731elpepiesp_1_I_SCO.jpg |
| Bob_Chiarelli | File:2006_Chiarelli_vote.png |
| Jonathan_Roberts_(dancer) | File:2006_Jonathan-Roberts_face.jpg |
| Alex_Munter | File:2006_Munter_vote.png |
| Larry_O'Brien_(Canadian_politician) | File:2006_O'Brien_vote.png |
| Anna_Trebunskaya | File:2007_Anna-Trebunskaya_face.jpg |
| Ted_Vasin | File:2007_ted_vasin_the_i_behind_this_i_M.jpg |
| Jaume_Plensa | File:20070621_Crown_Fountain_Water.JPG |
| Piyaporn_Deejing | File:2008thailand.JPG |
| Pleumjit_Thinkaow | File:200px-PJ0@DOHA.jpg |
| Toby_Goodshank | File:212bday01.jpg |
| Hector_Luis_Bustamante | File:249657869_l-1-.jpg |
| Csaba_Markus | File:250px-Csaba_markus_965.jpg |
| Tim_Ellis | File:2645.jpg |
+--------------------------------------------------------------------------------------------------------------------------+
40 rows in set (5.61 sec)
- Only one dodgy image, sadly. You can usually work them out from the filenames. There's probably a more efficient way of doing this, but I'll have to have a think about it. Thanks anyway :) Black Kite 00:57, 2 September 2009 (UTC)
Images again
The above ones gave me an idea, what about files that don't contain files, if that makes sense. I.e. where some user has created a file:example.jpg page, but didn't upload a file to it, it just contains text or something. Possible? Or would it just list every file thats been moved to commons?--Jac16888Talk 21:12, 31 August 2009 (UTC)
- You're talking about "File description pages without an associated file." Should be easy enough to query for (off-hand it looks like there are about 1,000 such cases currently). I also considered the inverse of such a list, that is "Files without an associated file description page." I ran that query without any limits and it produced the results available below. I'm not sure if it's possible to delete files that don't have a corresponding file description page (you could possibly try to create a page for them and re-delete as a workaround hack). However, there aren't really enough to justify a separate report. I'll create the "File description pages without an associated file" report probably tomorrow (bit too drained today). Cheers. --MZMcBride (talk) 01:06, 1 September 2009 (UTC)
- Thats a much better title :). No rush, I know you're otherwise engaged. Cheers--Jac16888Talk 19:59, 1 September 2009 (UTC)
- Files without an associated file description page
SELECT
img_name
FROM image
WHERE NOT EXISTS (SELECT
page_title
FROM page
WHERE img_name = page_title
AND page_namespace = 6)
AND NOT EXISTS (SELECT
page_title
FROM commonswiki_p.page
WHERE img_name = page_title
AND page_namespace = 6);
+--------------------------------------------------------+
| img_name |
+--------------------------------------------------------+
| WLVR_logo.jpg |
| 2005_jorge_ramos_new_3.jpg |
| Lunar_eclipse_chart_close-1944Dec29.png |
| Nas\x7e\x7e\x7e\x7e\x7e\x7e\x7e\x7e_bridgingt_101b.jpg |
| Ruben-studdard-love-is-album-cover.jpg |
| Hebb_bobby\x7e_sunny\x7e\x7e\x7e\x7e_101b.jpg |
| Church_2.jpg |
| St._Theresa_of_Lisieux_Church_2.jpg |
| AK-74,_Modernized.png |
| Lunar_eclipse_from_moon-2053Aug29.png |
| Lunar_eclipse_from_moon-1901May03.png |
| Lunar_eclipse_from_moon-1948Oct18.png |
| Birthplace_of_the_Computer.jpg |
| Lunar_eclipse_from_moon-1908Dec07.png |
| Destination_x.jpg |
| Mike_ford.jpg |
+--------------------------------------------------------+
16 rows in set (16.36 sec)
Done: Wikipedia:Database reports/File description pages without an associated file. --MZMcBride (talk) 16:08, 4 September 2009 (UTC)
Autoconfirmed users without 10 edits in content namespaces
Hi. Having seen sometimes disruptive users who could edit semi-protected articles and other things restricted to autoconfirmed users due to edits in talk or user namespaces, I was wondering if requiring 10 content edits for autoconfirmation, content edits being defined as those in the main, category, template or portal namespaces, could be useful or would generate too much collateral damage. So it would be useful to have a representative subset of users who wouldn't have been autoconfirmed if we required content edits. A one-time report of autoconfirmed users without 10 edits in content namespaces, including blocked users, would be useful. Thanks, Cenarium (talk) 01:22, 3 September 2009 (UTC)
- Running this query now. It's a fairly slow query; I'm not sure if it will need to be further optimized or not (it may take so long that it will be killed or need to be aborted). We'll see. --MZMcBride (talk) 07:21, 10 September 2009 (UTC)
- It never finished and I ended up killing it. There are about 640,000 autoconfirmed users on the English Wikipedia by my count. Still trying to figure out the best way to get an accurate report for you.... --MZMcBride (talk) 05:01, 15 September 2009 (UTC)
- Trying a new query now. Hopefully it should finish in about 24 hours. --MZMcBride (talk) 21:34, 17 September 2009 (UTC)
- Okay, so here's the deal. The revision table for enwiki_p is enormous, making any queries really slow and expensive. I pulled the user_name, user_editcount, and count(article, template, category, and portal) info for all users who have a user_editcount greater than 10 (half the autoconfirmed requirement). This was 633,494 users. Then I ran the data through a script and printed all the rows where the count(article, template, category, and portal) was less than 11. This came out to 141,814 users. In four days from now, all of these users will be autoconfirmed for sure. Right now, it's a bit difficult to guess how many are autoconfirmed, though I imagine that a large percentage of these users were not created in the past four days. I imagine most accounts hit autoconfirmed awhile ago. The 141,814 rows are available here: tools:~mzmcbride/cen-log.txt and tools:~mzmcbride/cen-log-zip.txt.gz (careful when opening large text files in a browser). I may try to refine the data a bit more.... --MZMcBride (talk) 04:20, 19 September 2009 (UTC)
- Thank you for your work on this, this is useful data. Since Count(namespace) doesn't include deleted edits, it lists more users than would be concerned, but I'm not sure it's feasible to refine the data; it's not especially important anyway... Maybe it would be more manageable if we had an explicit autoconfirmed userright. I'll look into this more closely in the coming days. Cenarium (talk) 04:40, 20 September 2009 (UTC)
- Okay, so here's the deal. The revision table for enwiki_p is enormous, making any queries really slow and expensive. I pulled the user_name, user_editcount, and count(article, template, category, and portal) info for all users who have a user_editcount greater than 10 (half the autoconfirmed requirement). This was 633,494 users. Then I ran the data through a script and printed all the rows where the count(article, template, category, and portal) was less than 11. This came out to 141,814 users. In four days from now, all of these users will be autoconfirmed for sure. Right now, it's a bit difficult to guess how many are autoconfirmed, though I imagine that a large percentage of these users were not created in the past four days. I imagine most accounts hit autoconfirmed awhile ago. The 141,814 rows are available here: tools:~mzmcbride/cen-log.txt and tools:~mzmcbride/cen-log-zip.txt.gz (careful when opening large text files in a browser). I may try to refine the data a bit more.... --MZMcBride (talk) 04:20, 19 September 2009 (UTC)
- Trying a new query now. Hopefully it should finish in about 24 hours. --MZMcBride (talk) 21:34, 17 September 2009 (UTC)
- It never finished and I ended up killing it. There are about 640,000 autoconfirmed users on the English Wikipedia by my count. Still trying to figure out the best way to get an accurate report for you.... --MZMcBride (talk) 05:01, 15 September 2009 (UTC)
Fair use images with multiple revisions
Basically, a list of all files in Category:All non-free media which: A) Have more than one image revision (e.g., a resizing or other edit made to the image itself, not to the description page), B) do not transclude {{Non-free reduce}}, and C) are not in Category:All orphaned non-free use Wikipedia files. The older revisions of these images usually need to be deleted per WP:CSD#F5 because they are unused non-free images. –Drilnoth (T • C • L) 15:47, 14 September 2009 (UTC)
Query:
SELECT
page_title
FROM page AS pg1
JOIN categorylinks
ON cl_from = page_id
WHERE cl_to = 'All_non-free_media'
AND pg1.page_namespace = 6
AND EXISTS (SELECT
1
FROM oldimage
WHERE oi_name = pg1.page_title)
AND NOT EXISTS (SELECT
1
FROM categorylinks
WHERE cl_from = pg1.page_id
AND cl_to = 'All_orphaned_non-free')
AND NOT EXISTS (SELECT
1
FROM templatelinks
WHERE tl_from = pg1.page_id
AND tl_namespace = 10
AND tl_title = 'Non-free_reduce');
Sample results:
- Adobe_Reader_8_icon.png
- Vmware_fusion.png
- Swiftfox.png
- IChat_AV_icon.png
- Xslimmer.png
- Gvloganathan.jpg
- Windows_Firewall_Vista_icon.png
- HFocusicon.png
- Icns.png
- Diskutil.png
- BOMArchiveHelper.png
- ITunes-aacp.png
- APPL-ttxt.png
- Heliconfiltericon.png
- Frame_icon.png
- Spotlight.png
- Wlm_logo-ic.png
- SkEdit.png
- Flashpaper.png
- BBEdit.png
Please check the sample data. It looks like there are 38,000 results total, so I'll probably limit output to the first 1,000 entries. --MZMcBride (talk) 02:17, 17 September 2009 (UTC)
- Thanks; I'll take a look at these and report back. –Drilnoth (T • C • L) 15:50, 17 September 2009 (UTC)
- I went ahead and create Template:Orphaned fair use revisions to deal with these. –Drilnoth (T • C • L) 16:06, 17 September 2009 (UTC)
- So we're good to go? I suppose you also want pages using {{orphaned fair use revisions}} excluded? --MZMcBride (talk) 18:12, 17 September 2009 (UTC)
- Yes, but see also discussion at WT:NFC. –Drilnoth (T • C • L) 18:51, 17 September 2009 (UTC)
- Well, for the moment, the raw output is available here: tools:~mzmcbride/drilnoth-non-free-reduce.txt (warning, large file, you may want to "save as"). I can still make a report, if you'd like. --MZMcBride (talk) 21:07, 17 September 2009 (UTC)
- Thanks; that should be fine for now. If we need a full report after the discussion, I'll let you know. Thank you! –Drilnoth (T • C • L) 21:12, 17 September 2009 (UTC)
- Well, for the moment, the raw output is available here: tools:~mzmcbride/drilnoth-non-free-reduce.txt (warning, large file, you may want to "save as"). I can still make a report, if you'd like. --MZMcBride (talk) 21:07, 17 September 2009 (UTC)
- Yes, but see also discussion at WT:NFC. –Drilnoth (T • C • L) 18:51, 17 September 2009 (UTC)
- So we're good to go? I suppose you also want pages using {{orphaned fair use revisions}} excluded? --MZMcBride (talk) 18:12, 17 September 2009 (UTC)
- I went ahead and create Template:Orphaned fair use revisions to deal with these. –Drilnoth (T • C • L) 16:06, 17 September 2009 (UTC)