Wikipedia:Bots/Requests for approval/EarwigBot 1
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.
Operator: The Earwig (Talk | Contributions)
Automatic or Manually Assisted: Automatic, unsupervised (although other users must confirm each copyvio before the request is denied.)
Programming Language(s): Python, Pywikipedia
Function Overview: Checks recent AfC submissions for copyright violations.
Edit period(s): Continuous (run from my own computer, so it won't operate when it's off, although my computer is on most of the time.)
Already has a bot flag (Y/N): N
Function Details: This is a bot that is resposible for doing a task that is completely ignored by the resident copyvio bot, CorenSearchBot. The bot's function will be that of a copyright violation-checking bot, similar to CorenSearchBot, which is currently the only running copyvio bot. However, instead of checking Special:NewPages for copyright violations, it checks Category:Pending Afc requests. Many new users submit copyrighted content for Articles for Creation, and CorenSearchBot does not seem to catch this. The bot will speed up the AfC proccess by placing this template on copyvio requests, saving reviewers time, and allowing them to spend more energy on reviewing other submissions. It will not deny requests: they will stay at pending status, but will have the message above them (example). The {{{URL}}} parameter will be replaced by a link to the bot's log page, which will have information about what site the violation was found on, and what specific strings are in violation of copyright. It then relies on other users (often that will be me) to check the request and deny it if applicable. Thus, this log page will serve a similar function to WP:SCV, because due to technical limitations with Pywikipedia, logging on SCV is much harder (although I might make it possible in the future). See this page for the bot's source code.
The bot was tested on my other bot account, EarwigBot I, which is used for one-time tasks (none yet) and for making edits on my and its own userspace (because this does not require bot approval). See EarwigBot I's log page for testing that has occured. All aspects of the bot's code have been written, and testing without making edits (the debugging feature) passed successfully. The bot is both emergency shutoff and exclusion compliant: the first was coded by myself and the second is supported by the standard installation of Pywikipedia.
The bot's code works by first generating a list of all pages in Category:Pending Afc requests, then checking it against the Yahoo API (I have a key) to see if there are any copyright violations. These tasks are both handled by this chunk of code, which is a modified version of Copyright.py. Then, the bot places this template on each of the suspected copyright violation pages, and loads the details of each one onto User:EarwigBot II/Logs. I could go into more detail about this process, but that's probably not necessary.
Now for an explaination of the bot's usefulness. I am a rather active member at WikiProject Articles for Creation, so I have a lot of experience when it comes to copyright violations concerning that section of Wikipedia. Second to notability, or possibly third to verifiability, there is no doubt that it is one of the most common reasons why a request gets declined. Now, usually a sizeable portion (5–10%) of the AfC submissions have some form of blatant copyright infringement, and an even higher number have copied sentences from other sources in them that are often not caught until its too late.
You may be thinking that the bot has no purpose, because someone is eventually going to catch the copyright violation. This is not the case, for two major reasons. First of all, Category:Pending Afc requests may not get backlogged a lot (this required over 52 articles in the queue), but there are often submissions in the category that remain unreviewed for days when this could be avoided if there were less articles to review. This bot would make the process move faster, something that's good no matter what part of Wikipedia we're talking about.
Second, not all AfC submissions actually have their copyrighted content removed after they are declined or accepted. I have yet to see a blatant occur when the article is accepted, but I've noticed copyright violations in declined requests on several occasions. (this wasn't caught in this submission and this wasn't caught in this submission, for example, and that's only in the first 200 entires.) And it's obvious that the reason why we would want to have these removed is the same major reason that we want all copyright violations to be removed: legal issues. Just because an article is in the declined requests pile doesn't mean that someone won't eventually notice it. Heck, I certainly did!
As a final note, I must point out that the internet copyvio checker portion of my bot's code is by no means rudimentary. One of the biggest problems I've noticed with bot requests is that the code doesn't account for loopholes, such how quotes with [sic] can affect spell-checker bots (one of the reasons we don't have them), or how free sources can affect copyvio bots (like this one). The Copyright.py module, developed by Francesco Cosoleto in 2006, not only maintains a client-side constantly-updated database of mirrors and forks from Wikipedia:Mirrors and forks, but it has an exclusions list and a protected-sites list. With these features, combined with a well-developed and tested module and a conservative template that does not straightaway deny requests, even the rare and dreaded false-positive will have little affect on how the bot will help Articles for Creation. Thank you for taking your time to read the details of this submission. I eagerly await any responses.
Discussion
[edit]Approved for trial (30 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. I'm curious to see how this works in practice. – Quadell (talk) 15:23, 6 May 2009 (UTC)[reply]
- I'll get started on that right away. Remember that 30 edits is more than 30 taggings; it's the taggings plus the information dumps to User:EarwigBot II/Logs (example). The data will be posted below when it starts to come in, and will be periodically updated from then onwards. The Earwig (Talk | Contributions) 21:56, 6 May 2009 (UTC)[reply]
EarwigBot II Logs — Trial 1 Last updated: The Earwig (Talk | Contributions) 02:46, 11 May 2009 (UTC)[reply] Edits completed: 25/30
Session 1 Pages checked: 32 Suspected copyvio articles found: 6 Number of Yahoo queries: 325 Edits:
Comments:
Results:
Pages checked: 26 Suspected copyvio articles found: 2 Number of Yahoo queries: 261 Edits:
Comments:
Results:
Pages checked: 27 Suspected copyvio articles found: 2 Number of Yahoo queries: 261 Edits:
Comments:
Results:
Pages checked: 26 Suspected copyvio articles found: 1 Number of Yahoo queries: 222 Edits:
Comments:
Results:
Pages checked: 13 Suspected copyvio articles found: 1 Number of Yahoo queries: 124 Edits:
Comments:
Results:
|
Code changes: I made a few changes to the code in light of the trial that is currently in progress. The bot is now running v1.1: The Earwig (Talk | Contributions) 22:47, 6 May 2009 (UTC)[reply]
- The bot now updates User:EarwigBot II/Logs before it updates the submissions by adding a template. (Done: 22:47, 6 May 2009 (UTC))
- The Copyright.py module now logs the date a little differently to avoid errors with anchoring. (Done: 22:47, 6 May 2009 (UTC))
The bot recognizes when a page already has the tag on it, and doesn't log or tag it. (Coding)
- Scratch that, not needed. I have done the other updates, and have decided that we don't really want or need the bot to be running constantly. Instead, I will always check to see the results of one run before moving on to the next, and assuming that not all of the entries will be checked by the time I run the bot the next time, I will do so. Therefore, there should never be cases where the bot tags a page multiple times. Maybe I could create some script in the future that activates the bot every ten minutes, for example, when I'm not around, but that's a future addition. As of writing time, two of the six copyvio findings by the bot were removed by other users. (Yay, it's working!) The remaining four have yet to be checked before I move on to another session, preferrably at around 8 o'clock tonight (0:00:00 UTC). The Earwig (Talk | Contributions) 23:10, 6 May 2009 (UTC)[reply]
I'm beginning Session 2 soon, as Session 1 is complete. Two more of the suspected copyvios were removed by other users, and I removed the last two. All of the requests involved a complete removal of content, except for this tagging. I declined this for notability, not copyvio, because the copyvio string was taken from a quote. The copyright.py module recognizes quotes, but this one was missed because it wasn't formatted properly. Other than that, all taggings seemed accurate. The Earwig (Talk | Contributions) 00:25, 7 May 2009 (UTC)[reply]
- Okay, I have nothing. Copyright.py turned up zero results, so Category:Pending Afc requests is clear of copyvios for the time being— I guess the bot caught them all. If half a day brought us six results and seven edits, then it will probably be two and a half days (Saturday) before the trial is complete. By then we should have a good understanding of the bot's future. One important point though: the bot might go over the thirty edit limit. It is almost impossible to have it stop exactly at thirty edits, because as long as it finds something, I can't stop it from tagging all of its findings, unless I change the suspected copyvios file on my computer, which would involve tampering with the results. So if I screw up, please don't get confused. This concludes today's operation of the bot, tommorow I will resume. The Earwig (Talk | Contributions) 01:21, 7 May 2009 (UTC)[reply]
- Thirty isn't an exact figure; anything between 20 and 40 would be fine. Thanks for being so thorough in your reporting. – Quadell (talk) 01:35, 7 May 2009 (UTC)[reply]
Session 2 results posted above. The bot tagged two articles. The Earwig (Talk | Contributions) 19:53, 7 May 2009 (UTC)[reply]
- Both are clearly copyvios, and the bot identified them correctly. But is there any way the bot can make it easier on the closers? An editor currently has to go to the bot log, figure out what it's saying, click on the link or links (usually many nearly-identical links) to check that it's a copyvio and not GFDLed, then replace the text with a copyvio notice, C&P-ing the url... and remove the data from the bot log. That's a lot of steps. Are all these necessary? Is there a way to streamline the process? – Quadell (talk) 20:24, 7 May 2009 (UTC)[reply]
- Valid point. I went through a few other ideas before I filed this BRFA. I know that I can't straightaway deny the request, because there will always be false positives, and this will mess everything up, especially for new users. I also thought that it might be a possibility for the bot to put the article on hold instead of denying it, but then I realized that this would create an enormous backlog and wouldn't really solve the problem. As I noted earlier, due to technical limitations, I am unable have the bot choose a link from its report and tag the submission with that instead of a link to the log page. Changing that would require me to change this substantially, which is something I simply can't do. I eventually decided on this proccess, while although it is a little more complicated, it seemed like a good idea. I agree, however, that the proccess is a little time-consuming. Do you have an idea? The Earwig (Talk | Contributions) 20:37, 7 May 2009 (UTC)[reply]
- Some ideas: how about I make it easier on the closers by creating some sort of preloader template (like the AfC submissions banner has) that has links to a quick way to deny a request or leave it. It would be placed below each log report, and if clicked, would link to that page, with the required changes already made in the edit window. I know this probably sounds confusing, but it's all I can think of. The Earwig (Talk | Contributions) 20:45, 7 May 2009 (UTC)[reply]
- Another idea: how about the bot recognizes a certain template that is placed on the log page? Essentially, instead of following all of those steps, the closer would simply click on one link in the copyvio template after confirming it. The link would place a certain template underneath the report, the bot would notice this, then act accordingly. This will take a long period of time to code, though, and it would complicate everything for me, but it might make it easier for the closers. Ideas? The Earwig (Talk | Contributions) 21:42, 7 May 2009 (UTC)[reply]
- Some ideas: how about I make it easier on the closers by creating some sort of preloader template (like the AfC submissions banner has) that has links to a quick way to deny a request or leave it. It would be placed below each log report, and if clicked, would link to that page, with the required changes already made in the edit window. I know this probably sounds confusing, but it's all I can think of. The Earwig (Talk | Contributions) 20:45, 7 May 2009 (UTC)[reply]
- Valid point. I went through a few other ideas before I filed this BRFA. I know that I can't straightaway deny the request, because there will always be false positives, and this will mess everything up, especially for new users. I also thought that it might be a possibility for the bot to put the article on hold instead of denying it, but then I realized that this would create an enormous backlog and wouldn't really solve the problem. As I noted earlier, due to technical limitations, I am unable have the bot choose a link from its report and tag the submission with that instead of a link to the log page. Changing that would require me to change this substantially, which is something I simply can't do. I eventually decided on this proccess, while although it is a little more complicated, it seemed like a good idea. I agree, however, that the proccess is a little time-consuming. Do you have an idea? The Earwig (Talk | Contributions) 20:37, 7 May 2009 (UTC)[reply]
- I think both of those are fantastic ideas. Anything you can do to make it easier on the closer is a plus, and having "push button" solutions in even some circumstances would be welcome. I would approve the bot either way, but it's better if we get all these things taken care of first. Just let me know what you plan, and whether you'd want another trial or what. – Quadell (talk) 00:09, 8 May 2009 (UTC)[reply]
(outdent) I think that I'll side with the first idea. It would be a small matter of and editintro and a preload template, which can easily be created and won't require much extra coding for the bot itself. (The second option is harder, maybe I can implement this at a later date, but not now). As for the trial, if you think that the bot seems good so far, then I'll continue where the trial left off (i.e., 20 edits remaining) with the new template updates. By the time the next twenty edits or so are made, we can decide if there are any other necessary changes that I should make before the bot is approved. Thanks for your help! The Earwig (Talk | Contributions) 00:28, 8 May 2009 (UTC)[reply]
- That sounds fine. – Quadell (talk) 01:49, 8 May 2009 (UTC)[reply]
Can I just say, as a participant of this WikiProject, that this bot seems to be doing a great job and is really useful for us. Some tweaks and streamlining might be in order, but don't worry too much because this is much better than what we had before which was nothing. About the idea of putting it on hold. This might work well actually. If you passed a code (e.g. "cv-bot") as the second parameter and the link to the log page as the third parameter then the reviewer tools could be adapted to provide a useful link. — Martin (MSGJ · talk) 11:39, 8 May 2009 (UTC)[reply]
- That's an interesting idea, which would aid in removing some of the complications evoked by having two templates. But again, won't this just create a backlog? I updated the bot's template to be a little more intuitive, and have added links to the bot's report in my version of the copyright.py module, containing quick links to accept or deny a request. Otherwise, no code changing has been done. I'm about to start Session 3 to see how the changes work. Quadell, what is your opinion on Martin's idea? I'm curious to hear what you think, because I might want to change the code to do what he suggested. The Earwig (Talk | Contributions) 20:57, 8 May 2009 (UTC)[reply]
- Why would it create a backlog? We have to review all of these anyway ... — Martin (MSGJ · talk) 21:25, 8 May 2009 (UTC)[reply]
- When I said "backlog," I wasn't referring to the 52-submission backlog that is a technical feature. I was referring to the fact that hold submissions often remain unlooked-at by reviewers because they are "on hold". But after looking at the hold criteria, I decided that your probably right, and this is probably a good idea. Implementing: it shouldn't be too difficult. Also, the Session 3 results, without the hold feature, have been posted. The Earwig (Talk | Contributions) 21:44, 8 May 2009 (UTC)[reply]
- Ugh, something's wrong with the regex in the pending-to-hold replacement code. wikipedia.replaceExcept() keeps sending the bot script into an infinite loop. I'm going to try something else. The Earwig (Talk | Contributions) 22:37, 8 May 2009 (UTC)[reply]
- The replace.py script will be able to change the pendings to holds, but this would result in two edits per submission page, which we can't do. Maybe I can implement it into the main code... The Earwig (Talk | Contributions) 23:08, 8 May 2009 (UTC)[reply]
- Never mind, that won't work. I'm just tired, probably. I can make it work, but an attempt at having that happen in the same edit keeps returning regex errors. When I fix those, I end up in an infinite loop. I'll try again soon. The Earwig (Talk | Contributions) 23:20, 8 May 2009 (UTC)[reply]
- The replace.py script will be able to change the pendings to holds, but this would result in two edits per submission page, which we can't do. Maybe I can implement it into the main code... The Earwig (Talk | Contributions) 23:08, 8 May 2009 (UTC)[reply]
- Ugh, something's wrong with the regex in the pending-to-hold replacement code. wikipedia.replaceExcept() keeps sending the bot script into an infinite loop. I'm going to try something else. The Earwig (Talk | Contributions) 22:37, 8 May 2009 (UTC)[reply]
- When I said "backlog," I wasn't referring to the 52-submission backlog that is a technical feature. I was referring to the fact that hold submissions often remain unlooked-at by reviewers because they are "on hold". But after looking at the hold criteria, I decided that your probably right, and this is probably a good idea. Implementing: it shouldn't be too difficult. Also, the Session 3 results, without the hold feature, have been posted. The Earwig (Talk | Contributions) 21:44, 8 May 2009 (UTC)[reply]
- Why would it create a backlog? We have to review all of these anyway ... — Martin (MSGJ · talk) 21:25, 8 May 2009 (UTC)[reply]
Session 4 results posted. I'm unable to have both the hold action and the template addition occur in one edit, so I just did the test with pending instead. The functionality for that, however, can be done. Hold on... The Earwig (Talk | Contributions) 02:20, 10 May 2009 (UTC)[reply]
- Implemented. Finally! I was being an idiot the whole time. The hold-adding feature has been implemented into the code, and the bot will not put its template on an article if one is already there. Both operations have been combined into one edit, and I am working on a feature to add "blacklist" pages for the bot to not check. See the bot's new code (v2.0) here. I feel much better now. Too bad the feature can't be tested right now (just made a copyvio run). The Earwig (Talk | Contributions) 02:49, 10 May 2009 (UTC)[reply]
- I tested it without edits at User:EarwigBot I/Sandbox 4, and I realized that I screwed up the regex, which I fixed. Now, the code is certified as correct. The Earwig (Talk | Contributions) 03:58, 10 May 2009 (UTC)[reply]
Session 5 results posted. I noticed a template error (easily fixed), and I have confirmed that the blacklist feature works! The bot logged an article that was on its blacklist, but it didn't change it. The Earwig (Talk | Contributions) 17:41, 10 May 2009 (UTC)[reply]
- I misread what Martin said; now the tagging is merged into the submission template, so it looks like this. It makes User:EarwigBot II/Template pointless now (because its content has been merged into Template:AFC submission). That's as simple as it's going to get. The Earwig (Talk | Contributions) 19:54, 10 May 2009 (UTC)[reply]
Is the trial complete? If not, just let us know when it is. – Quadell (talk) 22:48, 10 May 2009 (UTC)[reply]
- Er, I still want to run one more session before I put
{{BotTrialComplete}}
on this page, because I recently made a code change and I want to make sure that everything's OK. Thanks for reminding me, though. The Earwig (Talk | Contributions) 22:52, 10 May 2009 (UTC)[reply]
Trial complete. I ran the test I wanted, and it worked (although the bot didn't find any copyvios). The trial is now complete, with 25 edits made (although many of these were just logs done by the bot because I wanted to test a few things). I can now confirm that the bot is as well integrated into the AfC system as I can make it. Martin agrees. The Earwig (Talk | Contributions) 02:46, 11 May 2009 (UTC)[reply]
Approved. Looks good, and I hope it makes AFC run more smoothly and reliably. – Quadell (talk) 12:39, 11 May 2009 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.