Wikipedia:Bots/Requests for approval/ClueBot NG/Trial 2
Trial 2 discussion
Question
What does possible vandalism by 41.252.6.218 to version by SicaSunny Means ??
- It means the bot incorrectly identified the edit as vandalism. This false positive looks like it was caused by the bot not recognizing HTML color codes as such. This will be fixed as soon as the parser is complete. Crispy1989 (talk) 16:02, 18 November 2010 (UTC)
Thank you
[1] Philip Trueman (talk) 04:49, 19 November 2010 (UTC)
False positive exerience
It was not clear to me from the warning I got that I could revert the bot action. All the text was harsh, with none of the FAQ comments (addressed up page). Also, I was not sure if additional attempts to edit would make the bot treat me more and more as a vandal (as can happen with spam catchers on blogs that act up). 72.82.33.250 (talk) 08:19, 20 November 2010 (UTC)
- In this case, the false positive looks like it was partially a result of the earlier vandalism warning. Also, if you do not have any prior warnings, the first warning it gives is much nicer, and is more clear about what to do in the case of a false positive. If you have any ideas how to make it clear (on subsequent warnings) that the bot will not revert the edit twice, without making it obvious to vandals (and easy to revandalism), we'd love to hear it - we've been trying to think of a good solution to this ourselves. Crispy1989 (talk) 17:08, 20 November 2010 (UTC)
I think the inital vandal tagging was a bit of over-reaction for the type of error I committed, but I don't want to get into it more. I'm not "wounded". Just a datapoint among many. WRT the bot, I would leave it as is, in terms of harsh remarks for tagged vandals. The collateral damage is probably small and the benefits high. Just keep an eye on it.
Also for whatever it is worth, the mechanism of how to report a false positive seems pretty daunting, espeically for a new user (I basically blew it off for example). I suspect that the average false positive will not report through the mechanism now required. So if your 0.25% is the reported false positives, then the true rate is probably significantly higher. Some manual surveying ought to show that. (Still 0.25% may be right setting, but just realize actual collateral damage is higher). Of course, if I don't understand this, so be it...just trying to help.
- 0.25% false positives is based on a dataset trial with random, human-verified, edits. It is not only accurate, but the actual rate is less, because post-processing prevents reversions in some circumstances that will remove some false positives. The number of reported false positives has no bearing on these reported statistics.
- If you have an idea how to improve the false positive reporting, go ahead and make any changes you feel could improve it. Just tacking on a new section was getting unmanageable. I don't have any particular preference on exactly how it works, but I think that actual discussion should be clearly separated from false positive reports, and the false positive reports should be represented in a concise manner. Crispy1989 (talk) 20:49, 20 November 2010 (UTC)
I'm not sure mechanically how to improve it. It's just discouraging for a user to feel like he is prey to a machine and then that the appeals process is arduous. No biggie, just a datapoint...
ClueBot NG Reverting good faith edits
For example here. Access Denied – talk to me 16:44, 20 November 2010 (UTC)
I also found this. Wow. Access Denied – talk to me 16:46, 20 November 2010 (UTC)
- False positives with Cluebot-NG are (essentially) inevitable. The amount of caught vandalism depends on a set false positive rate. Currently, the FP rate is set at 0.25% - this has generally been deemed an acceptable price for eliminating over 50% of vandalism. Of the false positives that do exist, most are poor quality edits in some way (like the first of these two edits) that share traits with vandalism. There are occasionally unexpected reverts that don't appear to have any vandalism traits. This is a consequence of using a neural network as a core, and these should virtually disappear as the dataset grows. Crispy1989 (talk) 16:57, 20 November 2010 (UTC)
- If you accept that 0.25% of edits it catches are FPs, then I think it needs constraints on how many times it will revert the same editor. If it's possible, 1RR would be advisable for edits that aren't certain vandalism. It can afford to be more aggressive on things like profanities and blanking. HJ Mitchell | Penny for your thoughts? 01:19, 22 November 2010 (UTC)
- The bot already adheres to 1RR. It does not revert the same user/article combination more than once in the same day. This allows users that are reverted as a false positive to simply redo their edit without being reverted. The bot does not contain simple heuristics, so we cannot make it more aggressive for certain offenses. However, it may be possible to override the 1RR (this rule does make the bot miss a fair amount of second-time vandalism) in some strict circumstances, such as where the edit has a very high score, and more than half of the user's previous edits have been vandalism, or something like that. But before overriding 1RR under any circumstances, there should be significant community discussion on the issue. Crispy1989 (talk) 01:25, 22 November 2010 (UTC)
- If you accept that 0.25% of edits it catches are FPs, then I think it needs constraints on how many times it will revert the same editor. If it's possible, 1RR would be advisable for edits that aren't certain vandalism. It can afford to be more aggressive on things like profanities and blanking. HJ Mitchell | Penny for your thoughts? 01:19, 22 November 2010 (UTC)
Additional: [2] this is reverting the addition of an internal English Wikipedia link, formatted as an external link. Presumably it looks like spam, but perhaps this can be tweaked. Rd232 talk 20:17, 21 November 2010 (UTC)
New headers each warning??
Hi - it seems that ClueBot NG makes a new header for the month for every first warning they give out - there are 3 November 2010 headers here : [3] - if you can fix it that would be great :) --Addihockey10e-mail 18:16, 22 November 2010 (UTC)
- This is a known issue and appears to be intermittent. Cobi is working on fixing it. Crispy1989 (talk) 19:08, 22 November 2010 (UTC)
- It's an issue with the fact that ClueBot NG (and ClueBot) simply append a subst'd template to the end of the talk page.[Specifically, this one] Someone decided to add a header to the level 1 template. I will see about fixing this properly in the code, as it cannot be done in a template, but it is somewhat low-priority right now. -- Cobi(t|c|b) 19:42, 22 November 2010 (UTC)
Race condition?
This edit by the bot restored a bit of vandalism that had just been reverted in the same second. My guess is that it appropriately identified the vandalism but missed that it had been changed before it go to do it itself? --John (User:Jwy/talk) 20:37, 22 November 2010 (UTC)
- It's not a race condition, just a normal false positive. It presently has a few issues with vandalism reverts, because there are few/none present in the dataset. This should stop with time when the review interface dataset becomes large enough to use as a training set. Crispy1989 (talk) 20:40, 22 November 2010 (UTC)
- Thanks for the reply. I'll assume you will take care of getting it into your database if useful? --John (User:Jwy/talk) 21:34, 22 November 2010 (UTC)
- Blanking a section of an article, and replacing it with "DONKEY BALLS" isn't even remotely acceptable behavior for a bot. It's also 100% preventable, without dataset extension: the bot simply needs to evaluate the reversion it is about to make, as though it were considering whether an edit made by another user would be identified as vandalism. If so, the reversion should be suppressed. I assume that the bot would consider section blanking and replacement with "DONKEY BALLS" to be vandalism if done by a non-whitelisted user, as this is the sort of malicious edit that is most easily identified by anti-vandalism bots. Peter Karlsen (talk) 04:46, 23 November 2010 (UTC)
- Thanks for the reply. I'll assume you will take care of getting it into your database if useful? --John (User:Jwy/talk) 21:34, 22 November 2010 (UTC)
false positive rate
After reviewing hundreds of bot edits, I'm concerned that the false positive rate may be set too high. The 0.25% false positive rate sounds impressive until you consider more intuitive measures of performance. Assuming 10% of edits are malicious and the bot reverts 60% of those, an 0.25% false positive rate implies that 3.6% of the bot's reverts (1 in 28) are false positives. If the bot makes 2,500 reverts per day, that's 2,410 good reverts and 90 false positives per day.
If you view at Wikipedia purely as a data repository, that looks like great progress. However, Wikipedia is also a community of editors, one constantly in need of "new blood". I believe that false positives do great harm to the Encyclopedia by driving away good-faith contributors. Most of the editors hit with false positives are newcomers with less than ten edits. If an experienced editor gets wrongly reverted, she presumably knows enough to take it with a grain of salt. But a good-faith user whose first or fifth contribution is reverted three seconds later by a bot is unlikely to return. Most don't bother to report the error. Of course, such harm must be balanced against our workload as vandal-fighters and the harm that might occur if more vandalism went undetected. I raised these issues at User_talk:ClueBot_Commons#false_positive_rate and got many interesting replies, and now I am moving the discussion here. --Stepheng3 (talk) 20:01, 28 November 2010 (UTC)
- To summarize the discussion on the talk page (not in chronological order):
- Initially, several people misunderstood the meaning of "False Positive Rate", although it has been clearly explained in multiple places that is means "portion of legitimate edits that are incorrectly classified as vandalism".
- Using estimations from several users on actual number of false positives, the actual live false positive rate was calculated to be well within the stated 0.25%.
- Someone pointed out that many of the supposed "false positives" reported by the user(s) opposed to the bot's current performance are not actually false positives, and were indeed correctly reverted as vandalism. Even so, the counted number of false positives was within the expected 0.25%.
- A user used the bot's administrator shut-off (intended to be used when the bot is behaving unexpectedly) to stop the bot's operation. The same user later reversed this decision about a day later.
- There was some misunderstanding about the accuracy of the false positive rate. 0.25% false positives is not based on number of reported false positives, but is an accurate number based on a dry trial of random edits not used for training.
- Much of what is explained on the bot's user page and on this BRFA was reiterated, including how the FP rate is calculated, and how a certain number of FPs are necessary for the bot's proper operation.
- The impact of vandalism and importance of human vandal fighters' time was reiterated by myself and several other impartial users.
- It was implied that all users subject to a false positive will leave Wikipedia and never edit again. This was given without proof and is incorrect.
- The statistic that "1 in 400 incorrectly reverted legitimate edits is worth 200 in 400 correctly reverted vandalism edits" was put forth, and debated.
- Whether or not the time that human vandal fighters spend patrolling edits is significant, was debated.
- Whether or not human vandal fighters catch 100% of vandalism immediately, was debated.
- The fact that ClueBot NG's false positives are not what one would expect from a normal bot and often are not triggered by things such as bad words, was reiterated. This makes it much clearer to users that are subject to false positives that they did not do something wrong.
- It was suggested by an impartial user that his/her own human false positive rate is likely greater than the bot's.
- My position, and that of several others posting there, is that reducing human vandal fighter workload by half or more allows them to contribute significantly more new material to the encyclopedia. It also prevents half or more of the vandalism that currently gets through, from getting through, keeping Wikipedia twice as clean from undetected vandalism. I believe this is well-worth the minimal impact of less than 1 in 400 false positives, particularly considering that the warning makes it clear the revert may have been a false positive, and provides instructions for undoing the revert. Crispy1989 (talk) 21:16, 28 November 2010 (UTC)
- Your comments are incorrect, and self-contradictory. If users presently fighting vandalism reduce the amount of time spent on it to "contribute significantly more new material to the encyclopedia", the purported vandalism-reduction benefits of the bot will be blunted by the diminution of human effort in this area. However, if users do indeed contribute less time to vandalism reversion, the more likely outcome will be a reduction in their total contributions, since most users with a desire to write content are already doing so. Have many users actually said that they would write more for Wikipedia, if only they weren't tied up with RC patrol?
- For the purpose of comparing this bot's false positive rate to that of human users, it is absolutely imperative that the rate be quoted in the same terms that would intuitively be used to measure the accuracy of human anti-vandalism efforts: the percentage of the total edits reverted that are false positives. Once the false positive rate is provided in a comprehensible format, I believe that the ugly truth will become apparent: one would be hard pressed to find nearly as many false positives in 250 edits reverted by an experienced, skilled human user, as there would be in the same number of edits reverted by this bot during approximately the same time. Any edits which can be automatically identified as almost certainly vandalism, without an unacceptably high false positive rate, are already blocked by the edit filter. Contributions which are accepted by the filter currently require human judgment to evaluate, to avoid automated violations of WP:BITE and discouragement of new editors by an unacceptably inaccurate bot.
- The time spent on developing this bot shouldn't be considered wasted, however. Perhaps the neural network feature could be integrated with a human-assisted anti-vandalism program such as Huggle, as a means of identifying edits which are probably vandalism, with a user-adjustable threshold score for identification. Peter Karlsen (talk) 23:58, 28 November 2010 (UTC)
- The statements you say are incorrect are not my saying. They are summarized from the talk page discussion. Please take the time to read there. And yes, there are users that have made these statements.
- Whether or not the bot is used to revert vandalism is up to the BAG and will be decided at the completion of its trial. The benefits of using it in a human vandalism program are limited, as it is designed as a first line of defense. Considering that the false positive rate is adjustable and can be easily changed (I don't know why I have to keep saying this, people just don't seem to understand), there's no reason it shouldn't be approved.
- I find it unfortunate that I have to spend so much time repeating over and over things I have already said, when I could be spending the time improving the code. But apparently this is a necessity in getting community approval. Crispy1989 (talk) 00:08, 29 November 2010 (UTC)
- I am also surprised that, in all this complaining, nobody has suggested simply using an alternate false positive rate. I'll even take suggestions for thresholds. Every time Cluebot NG reverts an edit, it leaves a score. Suggestions for a score threshold or false positive rate (within reason) will be considered, and I can post stats on bot effectiveness given either the threshold or FP rate. Crispy1989 (talk) 00:31, 29 November 2010 (UTC)
- I am well aware that the target false positive rate is adjustable by the bot operators only. A member of BAG could presumably require you to reduce the rate. However, I have taken notice of the fact that, despite the mounting criticism of the bot's incorrect reversions, you have not actually reduced the false positive target. There would certainly be no objection to the bot running at a lower false positive rate than the one under which it was approved for the trials. This refusal to modify a clearly problematic bot task until a BAG member actually forces you to do so is worrisome. Therefore, I am evaluating the bot based on its present mode of operation, rather than some hypothetical alternate configuration that might exist, had you been more responsive to the community's concerns.
- The value of integrating a neural network into an application like Huggle is that existing filters used to present possibly malicious edits for human examination are extremely primitive. I would venture to say that over half of the "filtered" edits are not vandalism, while much of the vandalism that this bot catches is missed. The benefits of using a neural network to allow human users to identify likely vandalism in a flood of other edits would be extraordinary, especially considering that the target false positive rate could safely be set much higher in a manually-confirmed reversion application.
- Your claim that "in all this complaining, nobody has suggested simply using an alternate false positive rate" is untrue [4] - and the fact that you made it shows a disregard for community input. Whether there's a reason the bot shouldn't be approved depends largely on whether you are willing to respond to the community's critiques by lower the false positive target now. (I still believe that my suggestion of 0.1% above is prudent.) The ball is in your court. Peter Karlsen (talk) 01:04, 29 November 2010 (UTC)
- I have not lowered it below 0.25% because there has been no consensus. The BAG does have the final say, but if there was a community consensus, I would immediately adjust it. Looking at the bot talk page, there have been a number of instances of people happy with the bot's current performance. Additionally, at least one user has explicitly stated that they are happy with the aggressiveness. Without consensus, the only sane option is to delegate to the BAG for arbitration.
- Before you say that I do not listen to community input, you should take note of the fact that the original FP rate was set at 0.5%. I reduced it to 0.25% very early on, because at that point, there was consensus that 0.25% was preferable to 0.5%. I also evaluated your 0.1% suggestion, and determined that it would cause a significant drop in the bot's catch rate.
- 0.1% may be an acceptable value with decent performance, but the trial dataset is not currently large enough to accurately calculate the threshold. I will be able to accurately evaluate its effectiveness and calculate the threshold when the dataset from the review interface is approximately doubled in size.
- In lieu of a larger trial dataset, I can at present evaluate a given threshold, although as the bot changes, a set threshold can vary significantly. Reviewing the reported false positives (with a grain of salt - some of them aren't really false positives) may allow you to suggest a threshold. I would be open to running the bot for a day or two with a given set threshold within reason, to see if its performance in that mode is acceptable. Crispy1989 (talk) 01:26, 29 November 2010 (UTC)
- If it's unclear whether any given reduced false positives target would retain sufficient performance to significantly decrease the rate of false positives per edits reverted (which is what the community is actually concerned about), then why did you claim that because "the false positive rate is adjustable and can be easily changed... there's no reason it shouldn't be approved", as though this would solve all of the bot's problems? Given the uncertainty you've described, I find it reasonable to evaluate the performance of the bot under its present configuration, without assuming that there necessarily is a better one.
- BRFAs require consensus for approval. In light of the strong concerns many editors have expressed about the bot's excessive incorrect reversions, I don't believe that such a consensus exists. If the probability of an improvement in the bot's configuration is significant, then this request could be left open until it occurs, or is clearly not possible. Peter Karlsen (talk) 02:30, 29 November 2010 (UTC)
- I believe BRFAs require consensus among the BAG. The reason the BAG was created is that non-members often don't have the knowledge or perspective to make informed decisions on automated processes. As neither you nor I are members, it's not up to either of us to decide, and we should leave it to them.
- About the ease of changing the false positive rate, it is exactly as I have described. The automatic threshold calculation is a helpful feature on top of the core. For excessively low false positive rates, it requires a very large trial dataset to accurately calculate. At 0.1% false positives, our current trial dataset would only yield a single false positive. I could run it with this, but I'd rather not, because there are some people who would interpret this as an inaccurate claim. Instead, as I have already explained (and you, once again, ignored), the threshold can be manually adjusted based on observed false positives.
- As I already stated, I'd much rather spend my time actually working on improvements, as it's a continual process, instead of repeating myself and arguing. Whether or not the bot is approved in its current state is up to the BAG as soon as the trial ends. Crispy1989 (talk) 03:04, 29 November 2010 (UTC)
- I have increased the threshold. With the new threshold, our trial dataset (containing 963 good edits) has zero false positives. So the false positive rate should now be approximately 0.1%. The catch rate has decreased a fair amount (it's hard to tell exactly how much, again due to dataset size), but it should still be at least twice as effective as the old Cluebot. Crispy1989 (talk) 03:34, 29 November 2010 (UTC)
- I certainly hope that the change in the threshold for reversion ultimately produces better results than this. If the per edit examined false positive rate is halved, but the amount of vandalism caught is decreased by a similar factor, then the per-revert false positive rate, which is intuitively used to measure the accuracy of anti-vandalism work, will remain unchanged. Comments at this BRFA and at User talk:ClueBot Commons suggest that this continued high level of inaccuracy would still be unsatisfactory. The latest complaint about the bot, User_talk:ClueBot_Commons#Problem_with_the_bot, came in after the threshold for reversion was increased. The system for reporting false positives, and lack of individualized responses, was also critiqued. The claim that (with three bot operators) everyone responsible for the bot is simply too busy to articulate responses to the false positive reports [5] is troubling. The accepted Wikipedia standard for responses to claimed errors in automated tools designed to stop malicious edits, as shown at Wikipedia:Edit filter/False positives/Reports, is that each and every false positive report is examined, and receives a response to determine whether it is genuine. When an edit filter produces an actual false positive, it is usually possible to modify it to prevent a recurrence. However, you state that this bot is a "black box"[6] such that the cause of errors often cannot be ascertained and immediately corrected. Instead, false positives are generically attributed to "the dataset being too small". The only solution offered is to increase the size of the bot's dataset. But if the bot is not yet adequately configured to avoid an unacceptably high (per-revert) level of false positives, then why is it making live edits at all? Continuing the dry run, and examining which edits it would have reverted, would provide adequate data on new false positives, while relieving the bot operators of the burden of responding on-wiki to false positive reports.
- I apologize if my critiques of the bot's operations, and those of my colleagues, seem inadequately appreciative of your software development efforts. The theory of the bot's operation is original and intellectually intriguing; the present code, configuration, and dataset can already identify edits that are probably vandalism, and could be used to improve human-assisted anti-vandalism programs. With sufficient refinement, the bot may one day be an acceptable fully automated anti-vandalism tool. No deprecation of your contributions is intended in the candid observation that the bot is not yet ready for mainspace live-editing approval. Peter Karlsen (talk) 04:15, 1 December 2010 (UTC)
- The comment you link to is mostly out of frustration. Despite your apparent surety that the bot is inadequate, you are one of only two people that I can see to strongly complain about the false positive rate, where many people have been happy and satisfied with it. I find myself making these FP rate changes, just because it's very time consuming to carry on these debates about the same topic, where the pertinent information has already been stated in various places.
- The complaint was about the FP rate, not percentage of reverts that are false positives. The FP rate has been more than halved. In fact, the percentage of reverts that are false positives has also been decreased, due to the effect postprocessing has on the results. The FP rate is determined by the core, but the final decision to revert may be overridden by some set metrics in the Wikipedia interface. Because these metrics apply less-often to higher-scored edits, increasing the threshold lowers the percentage of would-be reverts stopped by the post-processing filters. Therefore, overall, even the percentage of reverts that are false positives has been decreased.
- I'd also like to point out that, due to these post-processing filters, the given FP rates, whether 0.25% or 0.1%, are maximums. Observed FP rate is likely to be significantly below this, as many FPs are caught and eliminated by the post-processing filters.
- To support this, take a look at one of the recent comments on the bot talk page, made after the threshold increase. The user states that they had to review over 100 diffs/reverts from Cluebot-NG to find a single false positive. While this isn't a wide sample set, it should give you some idea of the "accuracy" after the change.
- You may wonder why I still disagree with such a low FP rate, even if I know that it increases overall "accuracy" - the reason is that, for an antivandal bot to really make a difference, it has to revert a significant portion of vandalism. Bots like the old ClueBot reverted an estimated 5% of vandalism. This is enough to get it noticed by human editors when they're beaten to a revert, but doesn't significantly decrease the time necessary for human patrollers to spend. Even with the lowered 0.1% false positive rate, ClueBot-NG is more than five times as effective as the old ClueBot, but the entire purpose of an antivandal bot should be to make a real difference.
- You mention a recent complaint - but it is unrelated to the FP rate, or number of false positives at all. Rather, it is related to the handling of false positives. The discussion there clearly spells out our reasoning, and is mostly supported by at least one independent and impartial user.
- It really is very time-consuming to respond to every false positive manually, and even with three bot operators, there's not enough spare time to go around. One of the bot operators has a wife, two jobs, and school to worry about, and still finds time to work on bot development. Another of the bot operators spends most of his time working on dataset management, which as we've repeatedly stated, is what can most improve the bot - his remaining time is spent on real-life commitments. The third wasn't really involved in core development, and doesn't know enough about it to respond to false positives with anything more than a form-letter response.
- What's more is that individual responses are not necessary. As you point out, each one should be reviewed to make sure it's actually a false positive. And each one is. As stated in multiple places, each reported false positive is submitted to the review interface, where we can draw on community effort to classify them.
- While the system for reporting false positives was criticized, no suggestions were offered on how to improve it (by the primary user doing the criticizing). Another user spent the time to find a false positive and report it, and not only determined that most of the criticism was invalid, but also did indeed give some suggestions, which are being discussed and will likely be implemented very soon.
- The neural network is indeed a form of a "black box", but this is not the only reason that simply examining false positives will not directly and immediately help accuracy. As explained in multiple places, a certain number of false positives are absolutely necessary for the bot's operation. The choice to be made is simply how many false positives are acceptable, and the bot operates as well as it can, given that number. Ideally, with time, the FP rate can be decreased without hindering the bot's performance much or at all (as the dataset is improved), but false positives as a whole, and individual occurrences, can never be entirely eliminated.
- Extending this, it can be seen that your following comment about the bot not yet being ready is invalid. The dataset will always be able to be improved, more and more. It's a continual process. There will never be a point at which we can say "Stop. It's as good as it can be. Nothing more can be done." Just because there is still room for improvement doesn't mean the bot is not ready to make live edits. If this were so, the bot would never be ready. A point has to be set at which the FP rate is considered acceptable - you've suggested the FP rate of 0.1%, and that has been acted upon.
- Your suggestion about continuing a dry run is noted, but would never work in practice. Keep in mind that around or less than 1% (estimated as per the above-mentioned user that went through a series of diffs) of CBNG edits are now false positives. Reviewing data from a dry run would take 100x more time than responding to individual false positives - if there isn't time for the latter, there would never be time for the former. Another important consideration is that we cannot improve the dataset by ourselves, and nobody wants to spend time on a review interface for a bot that isn't active. The live edits, even at the current state, are not only extremely worthwhile for Wikipedia, but also bring in a steady stream of contributors to continue to help improving the dataset. Stopping the bot now would all but eliminate these contributions, and this would probably mean that it would never actually be approved.
- My comments about "lack of appreciation" on the user talk page discussion do not apply to you. While I disagree with you on most of the points you bring up (for reasons I believe are correct and stated above), it's also clear that you are trying to help, with nothing but good faith. I mentioned "lack of appreciation" on the talk page because the user in that context was engaging in nonconstructive flaming and even making up quotes from the dev team to try to make us look bad. Nonconstructive complaining, and, worse, flaming, are not welcome at all. All other forms of comments and suggestions, even if we disagree with them, are welcome, and at the very least open to consideration.
- The upshot of this all is, the bot is already in a much-more-than-adequate state to be running live. There is indeed still room for improvement, but there will always be room for improvement. The bot is already much improved from all predecessors, and only seems to be having more issues with false positives because of its much higher overall edit rate - so much so that things such as minor bugs in the Wikipedia interface, that have remained unnoticed and unreported for the three years the original ClueBot has been running, are now being noticed and fixed very rapidly. Even in trial, CBNG is making a significant difference, more than noticeable by vandal fighters and other users alike, as clearly evidenced by numerous comments on the user page and talk page. There are no significant outstanding problems - particularly when significance of problems is compared with previous AV bots. Crispy1989 (talk) 09:48, 1 December 2010 (UTC)
- I realize that information and logic about behavior of the bot, particularly related to false positives, has been spread out over multiple discussions in different places. To make it easier to follow along, I have consolidated the information in a few places, and tried to explain it simply and concisely: FAQ on CBNG False Positives, Detailed Information on CBNG False Positives, CBNG Algorithms. Crispy1989 (talk) 10:53, 1 December 2010 (UTC)
- I have increased the threshold. With the new threshold, our trial dataset (containing 963 good edits) has zero false positives. So the false positive rate should now be approximately 0.1%. The catch rate has decreased a fair amount (it's hard to tell exactly how much, again due to dataset size), but it should still be at least twice as effective as the old Cluebot. Crispy1989 (talk) 03:34, 29 November 2010 (UTC)