Wikipedia:Bots/Requests for approval/CorenSearchBot
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.
Operator: Coren
Automatic or Manually Assisted: Automatic
Programming Language(s): perl
Function Summary: Patrols new pages for copyright violations and duplicates of existing pages
Edit period(s) (e.g. Continuous, daily, one time run): Continuous
Edit rate requested: At most two three per new page, with a hard limit of 12 per minute (further edits are queued)
Function Details: This bot patrols newly created pages in the main space, and matches the contents against a web search. Pages found to contain a significant portion of text taken from another web page are tagged (and categorized) for human attention according to some guidelines. Complete details are available on the bot's page.
Discussion
[edit]Note: I have changed the name from "CorenGoogleBot" to "CorenSearchBot" to forestall any trademark misuse problems. — Coren (talk) 22:06, 3 August 2007 (UTC)[reply]
This bot already exists as wherebot Betacommand (talk • contribs • Bot) 04:38, 30 June 2007 (UTC)[reply]
- Well, given that I do new page patrol very often, and I catch a fairly large number of copyvios via google that Wherebot didn't catch (I suspect the search engine selection is a big factor, as well as the actual matching algorithms), I don't think SearchBot would be redundant. More like orthogonal/complementary or something.
- Also, my bot notifies the article creator and tags the page, which I think is a better way to attract attention to the problem. -- Coren (talk) 04:48, 30 June 2007 (UTC)[reply]
- Oh, and I think Wherebot actually ignores Wikipedia page copies, whereas SearchBot would tag those (they aren't copyvios, but are usually either vandalism or an error-- it finds them via Google so the original would have needed to be online for some time for it to percolate). -- Coren (talk) 04:52, 30 June 2007 (UTC)[reply]
Please note that by accessing Google through automated means, you will be violating Google's Terms of Service. The name of your bot also infringes upon registered trademark 75978469, owned by Google, Inc. A SOAP API is no longer available for Google. — Madman bum and angel (talk – desk) 05:26, 30 June 2007 (UTC)[reply]
5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.
- Oh, that is a good point, of which I was entirely unaware. I'll contact Google before I proceed. I am semi-confident I might get some approval for this given the purpose and (relatively) lightweight nature of my queries. -- Coren (talk) 05:43, 30 June 2007 (UTC)[reply]
- In the meantime, I'll look into other search engines that can give me reasonably parsable output. Changing the name of the bot to CorenSearchBot would be appropriate, yes? -- Coren (talk) 05:43, 30 June 2007 (UTC)[reply]
- I have just mailed a request to Google for permission to use their web search for this purpose. (copy here). I have no idea how long it will take before I get a response, however, and I could not find a much better channel for the request than the general form -- I suppose someone will forward it appropriately. They're not Evil after all. :-) So, at this point:
- Do I suspend the request for approval entirely?
- Do I request a name change immediately?
- I want to do this entirely above board. I think this could be very useful, though, so it's worth the effort. -- Coren (talk) 06:07, 30 June 2007 (UTC)[reply]
- For the name change, just register a new bot account. It doesn't need to retain it's edit history, and it has no flag yet. Also, the BRFA can wait until google replies, however you may withdraw it yourself if you really want, then reapply when ready. Matt/TheFearow (Talk) (Contribs) (Bot) 08:58, 30 June 2007 (UTC)[reply]
- I've actually developed a PHP system to check a block of text on a given subject for copyright violations, more or less the same as what you're doing here. My system is more a copyscape.com alternative for Wikipedia:AFC and Wikipedia:WWF (without the usage restrictions), but you might be interested in my approach. I basically run the text against the Yahoo! database - see [1] for API details - and do some parsing of it, while passing it through one of my own algorithms for checking copy-ness (telltale signs that text has been copied). The Yahoo! API may be a suitable substitute for Google in your case. --Draicone (talk) 08:13, 2 July 2007 (UTC)[reply]
- Yes, but given that there is already a bot using the Yahoo database, chances are we'd mostly duplicate work rather than complement each other and cover the other one's blind spots. Hence my (for the moment) preference for Google— which may be moot if I don't manage to reach someone at Google who's actually able to discuss this rather than respond with a form email. — Coren (talk) 20:22, 26 July 2007 (UTC)[reply]
- I've actually developed a PHP system to check a block of text on a given subject for copyright violations, more or less the same as what you're doing here. My system is more a copyscape.com alternative for Wikipedia:AFC and Wikipedia:WWF (without the usage restrictions), but you might be interested in my approach. I basically run the text against the Yahoo! database - see [1] for API details - and do some parsing of it, while passing it through one of my own algorithms for checking copy-ness (telltale signs that text has been copied). The Yahoo! API may be a suitable substitute for Google in your case. --Draicone (talk) 08:13, 2 July 2007 (UTC)[reply]
- For the name change, just register a new bot account. It doesn't need to retain it's edit history, and it has no flag yet. Also, the BRFA can wait until google replies, however you may withdraw it yourself if you really want, then reapply when ready. Matt/TheFearow (Talk) (Contribs) (Bot) 08:58, 30 June 2007 (UTC)[reply]
- Where are you with this bot? ~ Wikihermit 19:31, 26 July 2007 (UTC)[reply]
- At this point, I'm still in email echange with Google. I'm still attempting to reach someone who has authority, but I expect the people whose job it is to respond to email are directed to not do this as a rule— it's fair to suppose they they too much queries to always escalate. :-/ So, I'm still hoping. — Coren (talk) 20:18, 26 July 2007 (UTC)[reply]
- Why not use yahoo instead? ~ Wikihermit 06:08, 27 July 2007 (UTC)[reply]
- The user responded to this question above (redundancy, basically). — Madman bum and angel (talk – desk) 14:12, 27 July 2007 (UTC)[reply]
- Duplicating bots is a good thing - it means that in the downtime of one bot, others still do it's work. Matt/TheFearow (Talk) (Contribs) (Bot) 05:38, 28 July 2007 (UTC)[reply]
- Any news, have google got back to you yet? Matt/TheFearow (Talk) (Contribs) (Bot) 23:39, 31 July 2007 (UTC)[reply]
- Why not use yahoo instead? ~ Wikihermit 06:08, 27 July 2007 (UTC)[reply]
- At this point, I'm still in email echange with Google. I'm still attempting to reach someone who has authority, but I expect the people whose job it is to respond to email are directed to not do this as a rule— it's fair to suppose they they too much queries to always escalate. :-/ So, I'm still hoping. — Coren (talk) 20:18, 26 July 2007 (UTC)[reply]
- I think there would be probably a better response from Google if someone contacted them in a more official capacity. Since this sort of bot seems like it would be quite helpful, I wonder if there's any way we could get a Foundation representative to get in touch with Google? Andre (talk) 05:37, 1 August 2007 (UTC)[reply]
Will the bot post to Wikipedia:Suspected copyright violations? Will it not duplicate Wherebot (as in, if Wherebot posts a suspected copyvio before CorenSearchBot notices it, will CorenSearchBot know not to double-post it)? --Iamunknown 15:57, 1 August 2007 (UTC)[reply]
- Right now, it doesn't have any logic for avoiding duplicates— at least on where it posts (it will recognize tags on the article however. But you're right that this would be useful/needed. Will do. — Coren (talk) 01:15, 3 August 2007 (UTC)[reply]
- User:TheFearow asked me to resend an email to Google from a Wikimedia Foundation email. Assuming it meets standards, please send it to info@wikimedia.org, attention Jessica Barrett. Please specify to: address, subject, and supply the full text of the email. I will reply here with the OTRS ticket when/if a reply is received. ~Kylu (u|t) 06:39, 2 August 2007 (UTC)[reply]
- This is likely to have a much better chance at reaching a human being that has the ability to take a decision rather than simply quote the TOS. :-) Thank you. — Coren (talk) 01:15, 3 August 2007 (UTC)[reply]
Yeay for string alignment
[edit]So, here I am tweaking the matching, and CorenSearchBot just randomly noticed Robert T. Longway Planetarium is a copyvio of [2]. Yeay. :-) — Coren (talk) 22:15, 4 August 2007 (UTC)[reply]
Hey, look: Hot sauce vs [3]. Do we get to edit external sites with a copyvio tag? :-) (This is why looking at random pages, as opposed to new pages, is less likely to be useful). — Coren (talk) 00:02, 5 August 2007 (UTC)[reply]
Switched to Yahoo
[edit]Well, I've switched to Yahoo (for the moment). Search results aren't quite as good, but it works.
The 'bot works for finding and tagging (see [4], [5]). Can I get the okay for a short test run on new pages? — Coren (talk) 15:32, 5 August 2007 (UTC)[reply]
- Make sure the bot doesn't count mirrors of wikipedia, such as answers.com. ~ Wikihermit 15:40, 5 August 2007 (UTC)[reply]
- I probably don't have most of 'em, but see User:CorenSearchBot/exclude. — Coren (talk) 15:44, 5 August 2007 (UTC)[reply]
I've got all of them in text files. I'll copy and paste here. ~ Wikihermit 15:46, 5 August 2007 (UTC)[reply]- Great! Will merge now. — Coren (talk) 15:47, 5 August 2007 (UTC)[reply]
- (Oh, and yes, that's watched at runtime— which probably means we want to semiprotect it (and the /config) before we let CorenSearchBot loose). — Coren (talk) 15:47, 5 August 2007 (UTC)[reply]
- Wikipedia:Mirrors and forks. Same information, since the bot updates the text file before it's run. ~ Wikihermit 15:49, 5 August 2007 (UTC)[reply]
- Hm. Should I include all of them blindly? — Coren (talk) 15:56, 5 August 2007 (UTC)[reply]
- (It's to be noted that, since I only check new pages, mirrors should not have them— let alone also being indexed by a search engine). — Coren (talk) 15:57, 5 August 2007 (UTC)[reply]
- Wikipedia:Mirrors and forks. Same information, since the bot updates the text file before it's run. ~ Wikihermit 15:49, 5 August 2007 (UTC)[reply]
- I probably don't have most of 'em, but see User:CorenSearchBot/exclude. — Coren (talk) 15:44, 5 August 2007 (UTC)[reply]
- Indeed. I'd be sure to include the major sites (answers.com), but I won't worry about the minor sites. ~ Wikihermit 16:00, 5 August 2007 (UTC)[reply]
- I've just run a read-only check of 100 random pages; the three mirrors I have already listed are the only ones that pop up high enough in search results to be checked individually. Given that it's trivial to add a site to the exclusion list, even while the bot runs, I wouldn't worry about it until some minor site pops up in practice. Sounds reasonable? — Coren (talk) 16:06, 5 August 2007 (UTC)[reply]
As I said on the Wikipedia:SCV talk page, tagging articles directly is a bad idea, due to the unavoidable number of false positives. Just reporting to Wikipedia:SCV is enough to get the article handled properly. --W.marsh 14:12, 7 August 2007 (UTC)[reply]
Templates
[edit]Could any of you look at the various templates that my bot places on pages (User:CorenSearchBot/pageincluded, User:CorenSearchBot/pageincludes, User:CorenSearchBot/wikipage) and on user talks (User:CorenSearchBot/notice-pageincluded, User:CorenSearchBot/notice-pageincludes, User:CorenSearchBot/notice-wikipage)?
Hoping for: comments on wording, grievous errors, etc. — Coren (talk) 04:23, 5 August 2007 (UTC)[reply]
- Well, if the bot is going to edit the article, I'd have it blank it and add a {{copyvio}}, then add the page to the list at Wikipedia:Copyright problems. That's how human editors flag a possible copyright violation, and I think it'd be best if there was only one process. The message left on the user's talk page could be somewhat like {{nothanks-web}}, but assuming more good faith, as the bot may be in error. — madman bum and angel 16:44, 5 August 2007 (UTC)[reply]
- That might be presuming more intelligence from my match than is likely to be true. Because of differences in formatting, there is leeway for a certain amount of "fuzz", and it's always possible I get false positives. A template to bring attention to human editors and a nice notice seem more appropriate to me. — Coren (talk) 17:04, 5 August 2007 (UTC)[reply]
- It appears that new page patrollers notice the tags left by the 'bot and often place a speedy-G12 on the page. I expect that gives us the usual process, only helped with a fast pair of "eyes". — Coren (talk) 03:24, 6 August 2007 (UTC)[reply]
Trial
[edit]Well, it looks good. For using the new templates, assuming you are finished with most identified bugs, Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. That is using yahoo or google with permission. The templates look good, but I would recommend moving into templatespace to make it look more standard with other tags. Matt/TheFearow (Talk) (Contribs) (Bot) 21:02, 5 August 2007 (UTC)[reply]
- Will do. I didn't place them into template space originally because they were still tentative. I'm going to be moving them before the trial run. — Coren (talk) 21:06, 5 August 2007 (UTC)[reply]
As authorized by Matt/TheFearow [6], the trial will use the (older) User:CorenGoogleBot account to work around the captcha caused by the external links. — Coren (talk) 22:05, 5 August 2007 (UTC)[reply]
- Hmm. Should Luigi Mannochio have been marked vs [7]? It might just be my string alignment getting too smart for its own good, but it does look like a slightly edited version. Same events, same order. It was flagged as 'borderline'. — Coren (talk) 23:48, 5 August 2007 (UTC)[reply]
- Another question, right now my bot skips over pages that were tagged speedy before it got to them. Good thing? Bad thing? — Coren (talk) 01:31, 6 August 2007 (UTC)[reply]
Oh, By the way, the edits are done with User:CorenSearchBot after all: I had to approve edits manually anyways, I just added the code so that instead of just asking me Y/N it asked for the captcha. — Coren (talk) 01:42, 6 August 2007 (UTC)[reply]
- Flagged bots are not given captcha challenges. — madman bum and angel 02:08, 6 August 2007 (UTC)[reply]
- Odd. Wikipedia just stopped asking for captcha. I wonder why. Number of edits? — Coren (talk) 02:59, 6 August 2007 (UTC)[reply]
Hm. Not sure how I can fix that: page got deleted between detection and tagging— that didn't give me a conflict, but ended up causing the bot to warn itself (as the first contributor). — Coren (talk) 02:26, 6 August 2007 (UTC)[reply]
- When you submit your change to the page, make sure you send the wpStarttime you got from the edit form. That way, if the page gets deleted before your edit is submitted, you will be asked to confirm that you wish to recreate (which you don't, but as long as you don't tell the bot to send wpRecreate, you'll be fine.) — madman bum and angel 03:03, 6 August 2007 (UTC)[reply]
- Ah. Excellent. Thank you. — Coren (talk) 03:07, 6 August 2007 (UTC)[reply]
Fixed One more small squished bug: urls containing = broke template substitution. — Preceding unsigned comment added by Coren (talk • contribs)
- Great, looking good. I am slightly changing the trial, at the end of the trial, run it for 10 taggings on full-auto. That helps check whether that is working good, and lets some newpage patrollers see the sort of things its doing/raise any concerns. Matt/TheFearow (Talk) (Contribs) (Bot) 04:14, 6 August 2007 (UTC)[reply]
- Do you mean the last 10 edits of the 50, or 10 more after the 50? — Coren (talk) 04:24, 6 August 2007 (UTC)[reply]
- For that matter, when you say n edit, to you mean n tagged articles or n individual edits? Since every detected page creates two edits, I had taken the trial run to mean "25 articles". — Coren (talk) 04:34, 6 August 2007 (UTC)[reply]
A request: Can you program the bot to list the pages it tags as copyvios on a centralized page so that they can be reviewed by human editors? I have noticed, upon examining the bot's contributions, that some pages which it legitimately tags as copyvios are then untagged by anon editors, the page creator, etc. If the pages were listed in a centralized location, the blue links could be examined and re-tagged if they are truly copyvios. --Iamunknown 04:49, 6 August 2007 (UTC)[reply]
- Like Wikipedia:Suspected copyright violations? I had considered it, but that would push me to 3 edits per article... although I only get a couple an hour after all. Should I make the reports visibly differents from Wherebot's? Consider it done and configurable via User:CorenSearchBot/config (as in, tomorrow). — Coren (talk) 05:00, 6 August 2007 (UTC)[reply]
- I at first didn't realize that the bot would automatically tag articles itself. But even if it is, logging at WP:SCV would be appreciated, if it doesn't push the edit limit. So that editors know the listed articles are tagged, I would prefer
* (tagged by CSBot) [[Article name]] -- [Link%20to%20URL name of URL]. Reported on ~~~~~
as the format. Thank you for the new bot, btw. :-) --Iamunknown 05:54, 6 August 2007 (UTC)[reply]
- I at first didn't realize that the bot would automatically tag articles itself. But even if it is, logging at WP:SCV would be appreciated, if it doesn't push the edit limit. So that editors know the listed articles are tagged, I would prefer
Trial paused
[edit]Me needs sleep! :-) I've stopped the 'bot for the night, at 20 tagged articles (40 edits) since the trial started.
Looks good, if I may say so myself. Despite a few, quicly fixed buglets, the only false positive I got was not really false (it was PD originally, but copied to a copyrighted site). NP patroller reaction seems very positive, too! — Coren (talk) 05:07, 6 August 2007 (UTC)[reply]
- Ok, to make it clear, the trial was 50 edits. After that trial completes, I mean 10 more articles tagged. Also, it is a good idea to list the detected copyvios at a user subpage, to make it easier to keep track. Matt/TheFearow (Talk) (Contribs) (Bot) 07:08, 6 August 2007 (UTC)[reply]
Trial resumed
[edit]Fixed copyvios are now posted on a (configurable) subpage of SCV.
Fixed intra-wiki copies should not post to SCV.
The 50 edits are complete (51, actually). All looks good, so I'm about to let CorenSearchBot loose for 10 articles. — Coren (talk) 16:17, 6 August 2007 (UTC)[reply]
First, let me say thanks and congratulations for this promising tool for catching copyvios. Wherebot is awesome, but we can always use more eyes (human or machine) looking over new pages. I think your proposed templates are spot on.
I have a couple of concerns. First, if you include Wikipedia pages, you are going to get a lot of splitoffs, page moves and articles coming out of Wikipedia:AFC unless you arrange to screen these out. Wherebot actually picks these up from mirrors. And that's my other concern. I skimmed the above discussion and didn't see any mention of the many WP mirror sites. You don't want to take up your bot's precious time and edit count by flagging articles that appear on mirrors. Finally, I assume you've compared notes with User:Where and User:Wikihermit to get past their learning curves (one still in progress).
Good luck getting this up and running full time. I'm looking forward to reviewing CSB's submissions at Wikipedia:SCV! -- But|seriously|folks 07:31, 6 August 2007 (UTC)[reply]
- Mirrors are ignored (at least the ones high up enough in results to count), see the Switching to yahoo section above. Matt/TheFearow (Talk) (Contribs) (Bot) 08:53, 6 August 2007 (UTC)[reply]
There we go. The post-trial trial (the 10 articles) is complete. Looks like it went fine, too! — Coren (talk) 19:04, 6 August 2007 (UTC)[reply]
Question: Should CSB log directly on Wikipedia:Suspected copyright violations as opposed to a transcluded subpage? — Coren (talk) 19:56, 6 August 2007 (UTC)[reply]
Question: Should wikipedia page copies be logged on SCV for human review? (Like [8], which happens to have been a legitimate nearly identical copy in that case). — Coren (talk) 19:56, 6 August 2007 (UTC)[reply]
Question: Do we want CSB to look for explicit attribution to {{DANFS}}
(and other similar templates) to avoid tagging PD contents because some other random site also copied it? (Like [9] or [10]). Or is this giving willful spammers too easy a workaround? (The alternative is to rely on human review). — Coren (talk) 19:56, 6 August 2007 (UTC)[reply]
- I am not sure where I should comment, but as one of the people often stalking Wikipedia:SCV, I think you should not rely directly on the bot to tag copyvios.
Wherebot for example is wrong something like 20% of the time, and ensuring some human eyes are really reviewing the content before playing whack-a-mole on all {{db-copyvio}}ed articles might be a good idea. And that allows non admins to help reviewing articles.If you are reporting to Wikipedia:SCV (and I can't wait catching even more copyvios at birth), is it easily possible to ensure that you are not duplicating Wherebot's reports? -- lucasbfr talk 22:09, 6 August 2007 (UTC) woops, striking irrelevant part. -- lucasbfr talk 23:26, 6 August 2007 (UTC)[reply]
- Well, second point first: yes, if I report to SCV itself (as opposed to a subpage) it's trivial to ensure it's not a duplicate. That, indeed, would be the case if I report there (I already double check against duplicates on the page regardless of which page CSBot reports on).
- As for the first point, well, I think there are good reasons to tag the article preemptively;
- it lets the editor know about, and fix, the problem quickly (as happened here and here, for instance);
- CSBot's rate of false positive is, to date, insignificant (out of the 35-odd articles it tagged, one was a false positive— that is the external document did contain the same text as the article— three were justifiable copies, and one was two articles with very similar contents from the same editor); and
- it is, IMO, friendlier. Editor response to date has been fairly positive, and at least two learned about what is allowable in an article from the encounter.
- This is my opinion, and I agree with it. :-) It's also to be noted that the two PD articles mistakenly tagged would not have been if I turned on the attribution detection code (but there are arguments against doing that as well).
- As for why my false positive rate is low, it has everything to do with how willing I am at throwing processing time at the problem. CSBot uses a genome-matching string alignment algorithm that is very expensive to use, but returns a very good quantification of "how much X is like Y". It was designed for long strings of TCGA, but works just as well with words as the selection unit. :-) There's a paper lurking in that code for some day, I'm guessing. :-) — Coren (talk) 22:24, 6 August 2007 (UTC)[reply]
- Transcluded subpage is the best bet, it makes it easier to manage. For the WP page copies, I have two ideas. In the case of identical copies, replace with a redirect, and notify the creator. Another is to report on a seperate page, that is linked, that can be watched seperately. The false positives rate is VERY low for this type of bot. Since there have been some more questions/comments/concerns, I am extending the trial (again), for a full-auto Approved for trial (2 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Sorry about yet another trial, but it is a more controversial request (anything tagging articles and identifying violations etc is considered controversial, due to the issue of false positives), and it is better to get the communities opinion on the bot. Matt/TheFearow (Talk) (Contribs) (Bot) 22:37, 6 August 2007 (UTC)[reply]
- The funny thing is, the way my code works, I couldn't say "identical" if my life depended on it. By the time article (or web page contents) gets to the comparison stage, it's been so modified and shuffled around I can no longer perform exact equality tests. But the separate page is probably best for reporting "too similar".
- I don't mind the extra trial— it's not like I'm in a rush. I still have the occasional captcha which requires human input, however, so until I have a bot flag or the account becomes old enough (how long is that, four days?) I have to keep it running visibly— so it will have to run intermittently. (And I don't remember CorenGoogleBot's password *blush*). — Coren (talk) 23:13, 6 August 2007 (UTC)[reply]
- Ok, so just do a normal 2-day trial, doesnt have to be fully auto. Also, the exact one was a sort of bonus, but wouldnt two exactly identical peices of text be modified in the exact same way? Ahh well, maybe you need to pass on the original text in a seperate variable to the comparison functions? Anyway, continue with a 2 day trial. By the time all the trials are done, it should be able to run full auto, then that only requires a small trial, then we can do the semi-final trial, then it should be ready for the last trial, then another one with the bot flag. Can't wait! Matt/TheFearow (Talk) (Contribs) (Bot) 23:43, 6 August 2007 (UTC)[reply]
- A note: The above part about many trials was a joke, responding to the line "it's not like I'm in a rush". Matt/TheFearow (Talk) (Contribs) (Bot) 23:47, 6 August 2007 (UTC)[reply]
- Ok, so just do a normal 2-day trial, doesnt have to be fully auto. Also, the exact one was a sort of bonus, but wouldnt two exactly identical peices of text be modified in the exact same way? Ahh well, maybe you need to pass on the original text in a seperate variable to the comparison functions? Anyway, continue with a 2 day trial. By the time all the trials are done, it should be able to run full auto, then that only requires a small trial, then we can do the semi-final trial, then it should be ready for the last trial, then another one with the bot flag. Can't wait! Matt/TheFearow (Talk) (Contribs) (Bot) 23:43, 6 August 2007 (UTC)[reply]
- Har-de-har-har. :-) Yes, they (two identical pieces) would be modified the same way, but so would two very slightly different pieces of text; the conversion is one-way and looses some information (like a hash), so collisions are possible even if not likely. — Coren (talk) 01:22, 7 August 2007 (UTC)[reply]
- Ok, that seems like something not too easy to implement, so leave it for now. For everything else, continue with a 2 day trial. Matt/TheFearow (Talk) (Contribs) (Bot) 01:39, 7 August 2007 (UTC)[reply]
The bot should report directly to Wikipedia:SCV like wherebot does. This makes it much easier to manage the list of pending copyvios, since you just need to edit one page (and can view the whole page while editing it, which you can't do with transcluded pages). --W.marsh 17:14, 7 August 2007 (UTC)[reply]
Fixed Given the concensus on SCV talk, I've switched CSBot to append notices directly to Wikipedia:SCV. — Coren (talk) 19:10, 7 August 2007 (UTC)[reply]
Things do look pretty good to me, I'd say you're good for the green light (we need more copyvio bots) - as is it seems very useful -- Tawker 22:46, 7 August 2007 (UTC)[reply]
- Well, everything is working well. All we need now is a trial on full auto. I noticed some that were detected incorrecly as they were based on another WP page - see Kato Asea for one I noticed. Once you run it for several days on full auto, i'll approve. Matt/TheFearow (Talk) (Contribs) (Bot) 23:54, 7 August 2007 (UTC)[reply]
- It's on full auto now (has been for an hour or so). Have you looked at how nearly identical Kato Asea was to Marmaria, Arcadia? As far as I am concerned, the tag was correct— well, it worked as designed anyways; I can always shut that function down. — Coren (talk) 00:14, 8 August 2007 (UTC)[reply]
- Hmm. Just hit another greek city. I suppose that is a problem when someone enters a long sequence of nearly identical stubby articles (although not a very big one; taking the tag off is not very long or hard). Should I turn that function off? — Coren (talk) 00:19, 8 August 2007 (UTC)[reply]
- Will ask on SCV. Copying Wikipedia pages by C&P is a copyright problem if done for the wrong reasons (looses history). — Coren (talk) 00:20, 8 August 2007 (UTC)[reply]
- The thing is, they are copying and pasting but changing the information - they are mainly just keeping the format. I guess it isn't a huge problem. The trial looks good, so far no major issues remaining. Matt/TheFearow (Talk) (Contribs) (Bot) 02:03, 8 August 2007 (UTC)[reply]
See [11], tagging a redirect page as a copyvio. I think I've seen it do this other times, too. --W.marsh 17:18, 8 August 2007 (UTC)[reply]
- Hm. Race condition. See [12]. What seems to have happened:
- Page created with copyvio
- CSBot grabs page
- Admin deletes
- Admin recreates as redirect
- CSBot found a match and tags page (which is now a redirect)
- I'm not sure there's much to do about that. :-( — Coren (talk) 17:29, 8 August 2007 (UTC)[reply]
- Maybe you could store the exact text it read when it grabbed it, then grab the text again right before it tags. If they aren't the same, skip. That would likely stop that sort of thing. You could also perform redirect checks etc on the second grabbed revision, but just seeing if its different is probably suitable. Matt/TheFearow (Talk) (Contribs) (Bot) 21:40, 8 August 2007 (UTC)[reply]
- Although that would prevent tagging a copyvio that is being edited, even if just to wikify the copy. Hmmm. How about I requeue the page for a check if it changed? — Coren (talk) 21:41, 8 August 2007 (UTC)[reply]
- You could have the bot follow the redirect. ~ Wikihermit 22:21, 8 August 2007 (UTC)[reply]
- That's probably a bad idea; the redirect could be to an old page that has since been ripped off— I'd end up tagging the old article as a copy of the stolen web page. :-) This is why CSBot should never be run on old pages. — Coren (talk) 00:01, 9 August 2007 (UTC)[reply]
- I'd skip redirects then. For example, have it skip a page if it contains "#REDIRECT". ~ Wikihermit 00:02, 9 August 2007 (UTC)[reply]
Fixed CSBot will now refuse to touch pages that are redirects at the time of edit. — Coren (talk) 00:12, 9 August 2007 (UTC)[reply]
Well, the two days are up. I've stopped CSBot until you guys get around to checking up its edits. — Coren (talk) 00:11, 9 August 2007 (UTC)[reply]
- No major issues, and the people at SCV seem to like it. Approved.. Matt/TheFearow (Talk) (Contribs) (Bot) 01:11, 9 August 2007 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.