User talk:West.andrew.g/Archive 1
This is an archive of past discussions with User:West.andrew.g. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 | → | Archive 5 |
STiki?
Hi, I was wondering two things about STiki.
- When will STiki be generally available?
- What makes STiki different from Lupin's Anti-Vandal Tool, Twinkle, or any other vandalism fighting program?
Also, on a side note, you may want to change the STiki edit summaries. It gives a redlink for the STiki page, which it has going to the main namespace (article space). Happy Editing! Hamtechperson 20:35, 26 February 2010 (UTC)
- Hi Hamtechperson. The fundamental difference between STiki and existing anti-vandalism tools is that STiki examines exclusively revision metadata. Never once does it look at the revision/diff text when making a classification decision. In practice, my classifications might best complement (i.e., be an additional feature of) other vandalism tools. However, at the current point in time, I am focusing more on the academic merit of my spatio-temporal hypothesis. Features pull from IP geolocation data, time-of-day, article-history, user-history, revision comment length, etc.
- STiki is currently under-development, but near beta-testing. It's also pending on the publication of some academic writings. It is written in Java and works as desktop application (rather than the Javascript/Wikipedia-layer that seems to be the norm). I will ping you when it becomes available. While a big fan of Wikipedia, I am only a casual editor, at best. It would be nice to have the perspective of a Wikipedia user like yourself who understands the administrative nuances, etc.
- Regarding the edit-summaries -- I was trying to make a place-holder so that when I (eventually) write the STiki article, that previous reverts will link to it. I didn't want to write that article yet, because I feared it would go straight to speedy deletion, since the tool is not yet available, etc. How would you suggest I handle this? West.andrew.g (talk) 20:58, 26 February 2010 (UTC)
- Instead of having a page in the article space, which would probably be deleted anyway, even if STiki was out, you should have a page such as Wikipedia:STiki, and in your edit summaries have it as [[Wikipedia:STiki|Stiki]], which produces STiki. By doing this, you need not have the article in mainspace. You could also use this as a place to place news updates about the development of STiki.
- About the Article: Please, Please, Please, Read WP:COI before writing the STiki article. Also, you should read WP:NPOV. If anything, consider this; Twinkle and Friendly, two of the most well known Wikipedia tools, don't have their own Wikipedia articles. These are both tools that will likely be more well known then STiki, and, unless your STiki gets reliable, third-party sources, it too will be ineligible for it's own Wikipedia article.
Mr Tumnus
Hi, looks like you accidentally reverted into the middle of a multi-edit vandalism. Good luck with STiki by the way, seems like an interesting project. PaleAqua (talk) 23:45, 26 February 2010 (UTC)
Rollback
I have granted rollback rights to your account; the reason for this is that after a review of some of your contributions, I believe you can be trusted to use rollback correctly, and for its intended usage of reverting vandalism, and that you will not abuse it by reverting good-faith edits or to revert-war. For information on rollback, see Wikipedia:New admin school/Rollback and Wikipedia:Rollback feature. If you do not want rollback, just let me know, and I'll remove it. Good luck and thanks. JamieS93❤ 22:53, 8 March 2010 (UTC)
STiki on other projects
Hey, I stumbled upon your new anti-vandalism tool a few minutes ago, and I just want to know if it will be available for other projects(e.g. the swedish Wikipedia) in the future? thanks, tetraedycal, tetraedycal 23:16, 8 March 2010 (UTC)
- Hi tetraedycal. It is not my intention to port STiki to other languages/projects. However, STiki will be released as open source, and given that fact that it operates over spatio-temporal properties and not language-based ones -- it would be extremely easy for someone to make and host an edition for another language. Really, the only language based changes necessary would be (1) changing the user-facing GUI text, and (2) programming in the expected format of rollback strings so the tool can automatically find and learn over previous bad edits. If you are interested, I'll give you a ping when the software is released. Thanks. West.andrew.g (talk) 23:28, 8 March 2010 (UTC)
- Ok, great. :) And yes please, give me a notice when it's released. tetraedycal, tetraedycal 23:34, 8 March 2010 (UTC)
- I too would be interested in testing this out when its up and going. Congrats on the paper btw. Ottawa4ever (talk) 10:03, 11 March 2010 (UTC)
Ping
I would also be interested on it's release Mlpearc MESSAGE 19:19, 15 March 2010 (UTC)
Sign me up
When you are ready to roll STiki out, I will be happy to test it out for you. It's positive contributions like this that make me glad to be here. Keep up the good work! Avicennasis @ 15:54, 19 March 2010 (UTC)
Vandalism reverts
When you revert an edit to remove vandalism, please make sure you're actually removing the vandalism, and not simply moving it around, as happened here. --Carnildo (talk) 22:06, 25 March 2010 (UTC)
Ping List for STiki
If you are interested in being contacted when STiki becomes available, please sign below. Thanks, West.andrew.g (talk) 05:36, 22 March 2010 (UTC)
- Hamtechperson 20:35, 26 February 2010 (UTC)
- tetraedycal, tetraedycal 23:16, 8 March 2010 (UTC)
- Ottawa4ever (talk) 10:03, 11 March 2010 (UTC)
- Mlpearc MESSAGE' 19:19, 15 March 2010 (UTC)
- Avicennasis @ 15:54, 19 March 2010 (UTC)
Vandalism patrol
I don't have a handy barnstar, but if I did I'd give you one. Nice work. We need a lot more, and if you're developing a new tool, all the better. See you around.
Can the software issue warnings?
Is it possible to use the code to issue a warning to the person who did a disruptive edit using the software you created as well as doing the revert?. I must say i do like how it goes deeper into revision feeds then huggle would. So its possible to catch older incidences that were missed in the first pass. Thanks Happy editing Ottawa4ever (talk) 10:33, 21 March 2010 (UTC)
- Hi Ottawa4ever. This is a feature I am planning to implement. Should I append or pre-pend the warning template onto the user-page? Which template is considered standard? Thanks, West.andrew.g (talk) 15:54, 21 March 2010 (UTC)
- (wordy response follows)Its entirely up to you where you go with warnings, each vandal fighter is a bit different in their reverting preference behaviour. A list of typical warnings for disruptive editing are located here; Wikipedia:VANDALSIM#Warnings (there are more though these are just the typical often used ones). Personally (And this is just myself and others may differ) I have favoured the feature in Huggle where it can give the user the option to (or not to) leave a warning on the disruptive editors talk page after you have done the revert (It also has the option of revert and warn). But a problem in Huggle i've noticed, if you select revert and warn, it can often not revert and just warn, causing trouble. So I would think warning on the users talk page after the revert is safest. The key to the warnings is that it allows the editing behaviour of a user account to be tracked and if neccessary builds a case for reporting at AIV (In hopes to prevent further disruption from disruptive editors). So in the first case of where someone was warned a level one warning is issued, the second case of disruption a level 2 and thus forth to level 4. Beyond level 4, reporting at WP:AIV. I think older versions of huggle allowed you to specify which level, but the newer version simply do this automatically. Anyway let me know if you need any clarifications (Ive tried to be broad here and you likely already knew this stuff). Again its a very intriging software that youve written. I especially think it will be quite useful for catching the sneaky and overlooked disruptive editing. Ill probably give it a bigger test in the coming week.Ottawa4ever (talk) 17:01, 21 March 2010 (UTC)
- Hi again. I've made some changes so that a "warn user" feature (checkbox) is now available as part of my GUI. I've decided to use warn level 2 by default, since my tool has human confirmation of vandalism and is not a bot. I think the incrementing use of warn-levels is very intriguing and useful -- though it will take me a bit more time to implement something along these lines (I assume it will involve parsing the existing User-Talk page, to see if any warn templates are already present?). If the reversion fails to go through for any reason (intermediate edit), then the warning will NOT be left. In the future, I will consider adding support for variable warn-templates. Thanks again for your interest. West.andrew.g (talk) 06:45, 22 March 2010 (UTC)
- Actually, I took the time today to implement the incrementing user-warnings (now available for download at the STiki page). I think it is a slightly imperfect art: Twinkle, Huggle, and all the home-brewed strategies seem to have minor formatting differences and it is ultimately impossible to compensate for all of them -- but my strategy now seems solid and quite encompassing. If someone vandalizes with a recent uw-4, they are reported on the AIV page. I don't make warnings if a vandalism incident occurred long in the past. Warnings only take place if the revert succeeds. Again, thanks for your test and I look forward to your feedback in the future. West.andrew.g (talk) 04:34, 25 March 2010 (UTC)
- I intend to give it a good long test run today. Ill report on the stiki talk page my expereinces. Typically my preference is not to warn an IP if the edit is older. This is mostly because IPs rotate and its pretty rough when someone gets a warning saying they vandlized when they didnt and someone else did. So unless its a fresh edit, Ill pass on warning (unless its severly against policy, in which case my warning would be mostly custom to the editor). I think this tool has a distinct advantage in that its digging into older edits that wouldnt so much be in the Huggle que which as you said compliments the use and doesnt compete, looking good so far :)...talk soon, Ottawa4ever (talk) 08:52, 25 March 2010 (UTC)
- K gave the software a bit of a glowing feedback, but also suggested some fixes please have a look on the wiki talk page of STiki. Handles very well, you should be super proud of the software. Ottawa4ever (talk) 15:17, 25 March 2010 (UTC)
- I intend to give it a good long test run today. Ill report on the stiki talk page my expereinces. Typically my preference is not to warn an IP if the edit is older. This is mostly because IPs rotate and its pretty rough when someone gets a warning saying they vandlized when they didnt and someone else did. So unless its a fresh edit, Ill pass on warning (unless its severly against policy, in which case my warning would be mostly custom to the editor). I think this tool has a distinct advantage in that its digging into older edits that wouldnt so much be in the Huggle que which as you said compliments the use and doesnt compete, looking good so far :)...talk soon, Ottawa4ever (talk) 08:52, 25 March 2010 (UTC)
- Actually, I took the time today to implement the incrementing user-warnings (now available for download at the STiki page). I think it is a slightly imperfect art: Twinkle, Huggle, and all the home-brewed strategies seem to have minor formatting differences and it is ultimately impossible to compensate for all of them -- but my strategy now seems solid and quite encompassing. If someone vandalizes with a recent uw-4, they are reported on the AIV page. I don't make warnings if a vandalism incident occurred long in the past. Warnings only take place if the revert succeeds. Again, thanks for your test and I look forward to your feedback in the future. West.andrew.g (talk) 04:34, 25 March 2010 (UTC)
David Cross
It is relatively obnoxious to receive a message from some kid who has sequestered himself in one specific area of study, and thus acts and admonishes before he actually knows. Being a computer scientist is not tantamount to being a polymath, so you should refrain from your knee-jerk responses and admonishments until you know more about a particular topic, in this instance David Cross. This is a common problem amongst the technically-talented: belief that they are more gifted than they are (which is absurd when you consider the paltry amount of direct competitors in your specific area of concentration) which, as is the case now, often materializes as this quasi-authoritative B.S., at your delusional whim. It is a cute little program that you have, but as it exists now (and apparently you have beta testing it on actual information/articles, when it's clear that it has limitations) it is only as viable as the breadth of knowledge of the user, which means that your program is trying to operate from the midst of the one of the main problems with regard to Wikipedia, namely subjective limitation/skewing. As for the David Cross article, he frequently admits to only "fucking" Amber Tamblyn as his book and interviews like this one [1] illustrate nicely. The word "fucking" is the word Mr. Cross chooses to use when describing his relationship with Ms. Tamblyn, so who are you to edit that out using your little program and no understanding of the topic in question, thereby rolling the edit back to a less accurate version, and then to send me a message admonishing me? Are you the self-appointed information police? Apparently, Russ Tamblyn, Amber's father, does not care that David Cross keeps telling everyone that he is "currently fucking Amber Tamblyn."
- STiki doesn't flag vandalism automatically -- it requires humans to look at edits. Thus its limitations are more inconvenience than inaccuracy, but that's beside the point. I completely agree with your comment about subject skewing, but in this case I stand by my decision to revert your edit: You can't just go dropping the f-bomb in the middle of an article. There are more appropriate ways to paraphrase the same thing. Further, you could have quoted the word and provided a reference to the source you listed above. Had any of these criteria been met, I wouldn't have reverted you. I take no issue with the accuracy of your change, just how it was presented. West.andrew.g (talk) 16:09, 25 March 2010 (UTC)
Fair enough, Andrew. Thanks for your response, and my apologies with regard to my abrasiveness. Like you, I strive for objective truth and accuracy, even if this particular issue is somewhat unimportant, per se. —Preceding unsigned comment added by 69.114.38.15 (talk) 03:59, 26 March 2010 (UTC)
Researcher API rights
Andrew, you should get in touch with Erik Moeller at the Wikimedia Foundation. DarTar (talk) 10:42, 4 June 2010 (UTC)
STiki
I have noticed that if you miss a bit of vandalism on the page you can't go back a slide so i thought you might be able to make a forward and back button. Cheers and all the best. Let me know what you think of this idea,Gobbleswoggler (talk) 18:54, 7 June 2010 (UTC)
- The back button is something on my to-do list. I feel like I notice a good bit of vandalism in the milliseconds after I press a classification button. This is especially prevalent when I moving very fast through edits (I tend to to move quickest using the keyboard shortcuts -- if you don't know about them -- after you do a single mouse-based classification you can use the "v", "p", and "i" keys to classify very quickly, without the hassle of moving the mouse (available in newer versions, which you likely have)). Thanks for your feedback and I'll send you a message if/when this improvement gets implemented. Thanks, West.andrew.g (talk) 19:01, 7 June 2010 (UTC)
- When do you think the back button will be completed,Gobbleswoggler (talk) 19:14, 7 June 2010 (UTC)
- The button has been implemented. After some testing and documentation, I plan to upload a new version today. I'll post on your talk page when I do. West.andrew.g (talk) 15:58, 8 June 2010 (UTC)
- The new version has been uploaded with the improvement (version 2010/6/8). I've tested pretty thoroughly -- but let me know if you notice any strange behavior with respect to the "back" button. You can go back at most one edit, and cannot use the button to revisit an edit that was reverted. Secondly, I notice you use the "pass" button far more often than the "innocent" one. If your pretty confident an edit is not vandalism, go ahead and classify as "innocent" -- it helps maintain the edit queue and will make your user experience a little faster. Thanks, West.andrew.g (talk) 18:03, 8 June 2010 (UTC)
- Just out of interest how does STiki filter what slides to show and how does it know it might contain vandalism. Also I have an idea. How about start showing user that are registered that may vandalize. Gobbleswoggler (talk) 18:12, 8 June 2010 (UTC)
- I have published an academic paper describing the STiki philosophy. It is quite technical in nature. Briefly, it does machine learning over prior edits to identify the patterns (both in metadata, and more recently, natural-language) that are common among vandalism. This model is applied to new edits which produces a "vandalism score" that determines the order in which edits get shown (assuming no one else edits the page). Secondly, on the point of registered users. Analysis has shown such users are a very small part of the vandalism problem, so it is not a top priority. Indeed, I fear the inclusion of such edits may increase the false positive rate. Currently I am working on additional ML-features to improve the accuracy of the current system. Thanks, West.andrew.g (talk) 19:30, 8 June 2010 (UTC)
Re: Logs of Reversions
Hi Gurch, this is the author of STiki. We had some previous conversation about the relationship of our tools -- but I now come to you for a different reason. I was curious if you had logs of the RIDs which your tool issued a particular warning. In particular, I would like to gleam the edits your users classified as spam.
Of course, one could go searching through the UserTalk namespace and look for template additions and then parse out the 'diff/RID' in question. I thought you might have a quicker listing. The "edit summaries" left by Huggle seem a little generic, in that they don't provide RIDs or the reason for reversion, correct? Thanks, West.andrew.g (talk) 14:28, 13 June 2010 (UTC)
- Hi Gurch. Thank you for your long and thoughtful comments. For convenience, I've tried to provide my response "in-line" below. West.andrew.g (talk) 06:42, 14 June 2010 (UTC)
- No, I don't have any such logs. And there would be no feasible way to get them if I wanted them; Huggle clients work independently and do not communicate with any central server other than the wiki itself. Not to mention the privacy issues that would probably result in Huggle, and possibly myself, being banned from the project; there are a lot of privacy wonks here.
- Unfortunately (or perhaps, fortunately?) my tool has yet to find the popularity that seem to bring out such controversy and inconvenience. Your handling of the many feature requests, bug reports, and the like on your user-talk page is certainly admirable. I indeed have a central server and hope this does not become an issue in the future. I have had the good fortune to secure a presentation at WikiMania '10 -- I hope this brings a larger user-base, but few issues.
- Edit summaries do not include revision IDs because it is not particularly helpful -- and often infeasible -- to do so. The summary tells the reader at a glance that the revision was reverted, and when viewing the page history it is already clear which revisions were reverted because the author of the revision reverted to is identified. Nine-digit numbers are not very human-readable and when multiple revisions are reverted, listing all their IDs would quickly cause the summary to become too long.
- Agreed. Certainly not useful for humans (especially in the rollback case) -- but of course tool authors like myself wouldn't mind their inclusion. :-)
- Providing a reason for reversion is similarly problematic. Huggle has no concept of a reason for reversion, only for warnings. One reason is that if no reasonless reversion option were provided, users would overuse the vandalism reversion option and we would be left with many summaries claiming revisions to be vandalism when strictly speaking they weren't. Another reason is because of the dumb rules this project has that restrict what you supposedly can and can't do by certain technical means (rollback, undo) even when the end result is the same (effect of a revision undone) and the only difference is the resource demands on the client and server (and speed, of course). By restricting the concept of a reason to warnings users can revert edits they know are unacceptable, and then -- only if they desire to leave a warning -- they select an appropriate warning template, or leave their own message. In this way they can remove things like attempts to embed remote images with URLs in exactly the same manner as they'd revert anything else without administrators threatening them because they're "misusing rollback". In cases where users feel a detailed summary is required for reversion, Huggle already provides a mechanism for that.
- Again, I agree with your reasons from a tool and community perspective. To a large extent, the type of warning issued speaks to the "nature" or "cause" of the reversion -- though it is safe to assume users will over-use the standard "vandalism" option.
- If you are looking for a machine-generated log of "bad edits" to do some kind of machine-training on, as I suspect you are, you're going to run into trouble whichever way you go about it. This is not a problem that would be solved if I somehow had logs of all Huggle activity. (Also, I've tried such "training", and it works less well than the abuse filter, which is to say badly).
- The abuse filter prevents many supposedly "bad" edits from even happening, so you might think the logs of that would be a good starting point, but there are both too many false positives and too many filters that do things like enforce guidelines for that to be of any use.
- I don't play with the abuse filter. I know you are not terribly positive about it, either.
- Next you might consider looking at page histories, identifying reverts and then inferring from those which revisions were bad. That has many problems. People make mistakes and revert things by mistake, then correct themselves by reverting their revert, which would leave you with two revisions that looked bad but weren't really. People sometimes revert their own, otherwise good, edits if they change their mind and decide to put something else there. People revert edits that are either good or suspicious but not in a vandalism sense during edit wars, often back and forth between two revisions both of which have problems that are content- or style-related, which would again be misleading. And of course, vandals revert good edits; they usually get reverted themselves, of course, but how do you (automatically) know which one is the vandal and which not?
- It is not a fool-proof strategy, but this is largely the one that STiki applies. I identify rollbacks (via edit summaries), and then search back through the article history to find the offending edit(s). I do not consider cases where one rolls-back to themselves. An offending-edit isn't recorded if the rollback-initatior doesn't have rollback rights -- so this, for the most part, avoids edit warring.
- Users without rollback -- including anonymous users -- account for a surprisingly large portion of (correct) vandalism reverts, so you are possibly missing out on those. (And conversely, edit warring can happen between rollback users too, unfortunately.)
- The other option, which you seem to be going for, is warnings. This too has issues. Identifying which revision a Huggle warning was targeted at is simple because the revision ID is included in the URL in the warning message. However, multiple consecutive bad edits by a user will usually only result in one warning message, sometimes users will only revert and not leave a warning, if the user already has a final warning then Huggle will never leave a warning because it would be pointless, and most other patrolling tools do not identify the revision reverted in the warning message. And of course vandals will leave fake warning messages for legitimate users; yes, these are usually reverted when someone sees them, but vandals will also remove genuine warnings from user talk pages, so again you can't (automatically) distinguish the vandal and the legitimate user.
- I don't play much outside of NS0 -- and would prefer not to go there. Enough said.
- Possibly admirable but difficult to stick to if you want to gather information on problematic users, particularly as not only warnings but vandalism reports are located there.
- You'd be correct about my efforts to build a spam corpus. From an academic perspective, my methods need not be flawless. So long as I can have a set of RIDs which are primarily spam and another which are primarily ham, I can begin to make some property distinctions. My main concern is that most wiki-based detection methods don't represent a random sampling of spam. Spam which is detected via rollback just represents the "naive and immediately detected" attempts. What about that super clever editor who pulled some bait-and-switch tactic with a XLink and got it embedded on a article? If I had a way to find that, then I would have something!
- My guess is that for the most part the only common ground you'd find between spam edits is that they added an external link. And despite what many of Wikipedia's more influential contributors like to think (ever tried adding an external link while logged out?), treating all external links as spam isn't helpful. If we're only looking at the content of a revision and any other data that can be derived from the wiki itself, I don't think there's anything further that can be done to detect spam specifically -- sure, new users are more likely to spam that established ones, but they're more likely to be vandals, and even more likely to be neither. You'd probably have more luck detecting spam by identifying external links added, then accessing those links and looking at the content, but then we're into territory that isn't really specific to wikis (and probably not that amenable to machine learning either).
- The "super clever editor who pulled some bait-and-switch tactic with a XLink" replaced one external link on an article with another and nobody happened to notice it. The latter part of that is for the most part just luck. It's also something that's hard to detect automatically (reverting such an action would also appear to replace one link with another, as would fixing a dead link, as would trimming unnecessary query parameters, as would converting a plain link into a citation, and so forth). The difficulty from the point of view of the spammer is creating enough accounts / using enough IP addresses that they can do this enough times to get lucky and not have it noticed one time, and more likely by the time that's happened the link has ended up on the spam blacklist.
- The wiki has a huge number of things designed to stop spam, most of which just get in the way and make things unpleasant for newcomers, all in the name of combating something that really isn't that much of a problem. I'm not entirely convinced that more are needed.
- MediaWiki:Spam-blacklist. Not only can nobody add any links on that list, if a link is added to that list when it's already on a page, now nobody can edit that page at all. And it uses regular expressions so you can guarantee someone with a poor understanding of them will break it now and then. The list is what can only be described as freaking enormous, most of the links on there chances are nobody knows when or why they were added and indeed the websites they pointed to probably don't even exist any more. Because editing just isn't slow enough without running a few zillion regexes on every save.
- As I previously said, trying to add external links as an anonymous user is automatically assumed to be malicious. This makes vandalism patrolling as an anonymous user pretty much impossible -- every time a vandal messes up content that happens to include an external link, you've got to answer a stupid captcha again. And if you want to link to a diff, or history page, or old revision, or something else on Wikipedia itself? Sorry, we still think you're a spambot, please answer this captcha.
- User:XLinkBot. Logged in? Got a link to add? Not on the blacklist? Yay, you might think, until this bot decides to revert you and warn you about it. The bot's list of links to revert is almost as long as the real spam blacklist, and just as dubious.
- Several "anti-spam" components of MediaWiki, some of which -- unlike the rest of MediaWiki and indeed pretty much everything powering Wikipedia -- are closed-source. With these MediaWiki will just flat out refuse to accept your edit.
- External links policies drafted by the small number of vocal contributors that make most of the policies, that give administrators the power to remove pretty much any external link they like.
- Even more insane external links policies that were rejected by the community but are still de facto in effect because of the consequences for any contributors who voices opposition to them.
Stiki
I got another great idea. How about putting in a filter so you could put in for example a bad word and see what pages it is on. Then if that word is on a page it shouldn't be on you can delete it. What do you think?, Gobbleswoggler (talk) 18:23, 14 June 2010 (UTC)
- Hi again Gobbleswoggler. This is something I will think about -- but it is also something a lot of other people are doing (including STiki, to a certain extent). First, ClueBot operates exclusively over simple regular expressions (i.e., bad words). Plenty of bad-words do get through Cluebot though (since it is a bot, its rules must be conservative to avoid false positives).
- Second, there is the Wikipedia edit filter. This is not my area of expertise, but I am sure there are plenty of filter-rules along these lines.
- Third, there is STiki. STiki counts the number of bad words added by an edit and uses this as a machine-learning feature (along with the several spatio-temporal ones). Part of the challenge here is what constitutes a "bad word". Obviously a stand-alone bad word counts. For example, "you are an ass" is trivial, but we can't expect vandals to use proper grammar. Instead they might write "youAREass" -- clearly profane -- but the pattern match for something like this would also match the very innocent word "glass." What are your thoughts on this? I might be able to use my existing bad-word count data to create a revision filter, though -- just to give things a trial run.
- Finally I'll note that a surprising number of the "bad words" on Wikipedia are legitimate. Between song names, accurate quotings and the like -- I am not sure this would have the great hit-rate you might expect. Another thought is that I could highlight bad words (using some color), making them easy to pick out when quickly patrolling.
- How can i download/get this edit filter? Gobbleswoggler (talk) 19:12, 14 June 2010 (UTC)
- It has yet to be implemented -- though I have most of the data in the back-end. I want to give this some thought, and would be interested to hear how you think the scoring and ranking of edits should proceed. Thanks, West.andrew.g (talk) 19:33, 14 June 2010 (UTC)
- Have you thought about adding a subject button so you might just want to check certain areas of wikipedia eg.football,talk pages. What do you think? Gobbleswoggler (talk) 19:45, 14 June 2010 (UTC)
- I think he's talking about searching existing page content, not filtering recent changes. For example, searching for all pages with the word "crap" on them and then looking for any unwanted instances. The problem with doing this is that Wikipedia's search results are not current but always a few days old, so most of the unwanted instances that turn up in the results will already have been removed, and even if you find and remove some yourself, they won't disappear from the search results, nor will new ones show up, until a few days later. Google and other search engines have the same problem. Gurch (talk) 06:26, 15 June 2010 (UTC)
Talkback
Message added 21:09, 16 June 2010 (UTC). You can remove this notice at any time by removing the {{Talkback}} or {{Tb}} template.
I'm currently writing a suggestion, please check back at my talk page soon... Cit helper (talk) 21:09, 16 June 2010 (UTC)
Stiki update
Hi,Gobbleswoggler here yet again. i noticed you have published a new update of STiki but i can't tell what's been added or changed!,Gobbleswoggler (talk) 15:39, 18 June 2010 (UTC)
- Hi Gobbleswoggler. In this particular update there were changes on the back-end (which determines the edits that get displayed), not to the client-side GUI application (which you use). From your perspective, nothing should change (except maybe seeing more vandalism). The back-end change reflected an advance in how "dirty words" are scored, partially from your suggestions. Note that in the *.ZIP file of every distribution there is a file called CHANGELOG.txt -- this will provide you a description of the changes. Thanks, West.andrew.g (talk) 20:32, 18 June 2010 (UTC)
STiki
What has changed on STiki this time?,Gobbleswoggler (talk) 16:40, 28 June 2010 (UTC)
- See my note above about the CHANGELOG.txt file in the ZIP distribution. The last two updates have been minor. One affects the back-end processing, and the other is a tool for research use. Nothing to exciting on the client-side. However, I am working on integrating the 'rollback' function into STiki (naturally for the those who have it, and into software for those who don't). That should be helpful for cases of multi-edit vandalism.
Misdirected Testing?
Checkuser results suggest that one of your linkspam related software tests may inadvertently be pointing to the English Wikipedia rather than test wiki. Please check your settings & adjust accordingly. Thanks, --Versageek 03:08, 14 July 2010 (UTC)
Hello.
I have blocked this account (amongst others) for the recent issues with regards to recent tests done on Wikipedia's articles. Please contact the Arbitration Committee via email @ arbcom-llists.wikimedia.org at your earliest timeframe, to discuss this. SirFozzie (talk) 16:37, 21 July 2010 (UTC)
Response to unblock request
The Arbitration Committee has reviewed your block and the information you have submitted privately, and is prepared to unblock you conditionally. The conditions of your unblock are as follows:
- You provide a copy of the code you used for your "research" to Danese Cooper, Chief Technical Officer and to any other developer or member of her staff whom she identifies. [Note - this step has been completed]
- You review any future research proposals with the following groups: the wikiresearch-L mailing list <https://lists.wikimedia.org/mailman/listinfo/wiki-research-l>; the wikimedia-tech mailing list for any research relating in whole or in part to technical matters; and your faculty advisor and/or University's research ethics committee for any research that involves responses by humans, whether directly or as an indirect effect of the experiment. Please note that your recent research measured human responses to technical processes; you should be prepared to provide evidence that those aspects have been reviewed in advance of conducting any similar research.
- Should this project, the Wikimedia Foundation, or an inter-project group charged with cross-site research be developed, they may establish global requirements for research which may supersede the requirements in (2) above.
- Any bots you develop for use on this project, whether for research or other purposes, must be reviewed by the Bot Approvals Group (WP:BAG) in advance of use, unless otherwise approved by the WMF technical staff.
- You must identify all accounts that are under your control by linking them to your main account. The accounts used in your July 2010 research will remain blocked.
Please confirm below that you agree to abide by these conditions when participating in this project. Once you have done so, a member of the Arbitration Committee will unblock.
For the Arbitration Committee,
Risker (talk) 12:55, 11 August 2010 (UTC)
- I agree to these conditions, and offer a sincere apology to the community. Thanks, West.andrew.g (talk) 13:04, 11 August 2010 (UTC)
- Some context on this block would later be published in a paper published at WECSR'12. West.andrew.g (talk) 17:56, 13 February 2012 (UTC)
You are now a Reviewer
Hello. Your account has been granted the "reviewer" userright, allowing you to review other users' edits on certain flagged pages. Pending changes, also known as flagged protection, is currently undergoing a two-month trial scheduled to end 15 August 2010.
Reviewers can review edits made by users who are not autoconfirmed to articles placed under pending changes. Pending changes is applied to only a small number of articles, similarly to how semi-protection is applied but in a more controlled way for the trial. The list of articles with pending changes awaiting review is located at Special:OldReviewedPages.
When reviewing, edits should be accepted if they are not obvious vandalism or BLP violations, and not clearly problematic in light of the reason given for protection (see Wikipedia:Reviewing process). More detailed documentation and guidelines can be found here.
If you do not want this userright, you may ask any administrator to remove it for you at any time. Courcelles (talk) 01:08, 18 June 2010 (UTC)
Feedback
I've been using Stiki for a few days and probably ran through a few thousand diffs. Thought I'd give some feedback:
- Hi Ocaasi. Thanks for your kind words and apologies for my slow response. I'll make some comments in an "in-line" fashion, here.
1. Great program. Unbelievable that some glaring vandalism slips through. Warnings work great. Interface is clean and intuitive.
2. Give "Penis" and "Gay" (fag, cock, and stupid) quadruple the weighting. Probably .01 percent of every edit including these terms is not vandalism. You mentioned color-coding "bad words"; I think basically all bad words could get automatic review with STiki, since they often slip through other bots. I recognize you're looking for the algorithm and not just the end result, but there might be use for a little parallelism. STiki could use some filters (even more than currently) that are separate from the learning algorithm. Those edits could just get flagged for having bad words but not necessarily factored into the algorithm. You might get useful data for refining the bad words filter currently in use.
- Obviously, a list of "bad-words" is one of the features I use. The training-phase of machine learning determines how much "weight" this feature will get. I suppose I could have two lists to separate the "bad" from the "really bad" -- and see if this is of any use. However, given that these words are low-hanging fruit, I am not sure how many slip through to become embedded. After all, ClueBot already handles the worst cases.
3. STiki sometimes flags edits where nothing seems to have changed in the diff (maybe empty space formatting). i.e. this edit
- This is perplexing to me as well. (Also, why does WP even allow such an edit to be committed? Or am I missing something?). Other factors are obviously causing STiki to believe it "could" be vandalism though (meaning the timestamp, IP-geolocation, IP-history, etc.). I will author some code that prevents this "completely empty" corner-case, though -- it's on my TODO.
4. STiki sometimes flags corrections to wikilink formatting, often suspecting corrections like [ -> [[ of being vandalism, when they're just correct formatting. I know you're using an alphanumeric filter, maybe the key here is to exclude edits which are neither alpha nor numeric but only symbols (particularly: [ { * ); they are usually false positives (flagged by STiki, not vandalism).
- This will be considered as an additional feature.
5. STiki often flags changes to numbers, particularly updates to sports scores and other annual data. Not a problem, but there's often no way to determine if changes are accurate without digging into the references. I guess as long as they are in the right places the presumption is that they aren't vandalism.
- You mentioned the "alpha-numeric" filter above. There is a feature that says "what % of additions are numeric". My thought was that few pure-numerical things would be cited as vandalism -- and this would stop showing people these "boring" and "un-classifiable" diffs. It's reduced it -- but not stopped it. I think these basic changes might constitute a large portion of edits to WP. I'll think about this.
6. Often there are edits which are not improvements but also are in good faith. In those cases, it'd be nice to have an option to quickly revert but not call it vandalism. I currently use a message that is neutral with regard to the reason and then toggle the warning. Perhaps a fourth button could trigger a different message, no warning, or a non-vandalism warning (something like undoing good faith edit, not vandalism). Along those lines, I imagine there are some unintended algorithmic effects from people using STiki to undo edits which are marginally less good but not vandalism. They might skew the machine-learning, muddying the salient traits of likely vandals. If so, it would make sense to distinguish between a plain undo (which wouldn't be used to refine the algorithm) and vandalism (which would).
- On the agenda (and has been for some time).
7. Sometimes the edit comment is cut off. It usually doesn't matter, but it'd be nice to be able to read the full text.
8. The external links work for me but still trigger the link cannot be called message, even as the browser is called and correctly accesses the page.
- What is your OS? And how do these errors manifest themselves? In the terminal? A pop-up dialog?
9. I'm curious about the benefits of more friendly/personal edit warnings. I think they might have some benefits over 'harder' approaches which sound like they come from bots rather than people. A short, personal looking message with no 'templatey' feel might have more influence than a message which seems automated: Something like, Hi [Vandal], this is [Reverter]. I'm leaving you a [first/second/third/final] note about your edits, after looking at [Article Page]. It looked like a problematic edit to me, so I reverted it. If I was wrong, please change it back. We do keep an eye on contributions to keep the encyclopedia looking good, and we'd love your help to make the encyclopedia better. If you have any questions, leave a note on my talk page.
- Unfortunately, this is something that is out of my hands. The rules at WikiProject:Spam dictate the templates that should be used. Since most vandalism detection is done using software (STiki, Huggle, etc.) -- these programs are looking for specific templates so they can automatically issue the next-most-severe warning. I suppose I could hide/place the requisite text to convince these programs that my custom comment is a "template equivalent" -- but this is not something I intend to do.
Thanks for the program ...! Ocaasi (talk) 15:47, 21 July 2010 (UTC)
- Thanks for all your comments. Many of the things you discuss are (and have been) on my mind -- but I lack the time to implement them due to progressing graduate study. You seem to speak knowledgeably about many of the technical aspects. If you are interested, and have the technical ability, I'd be open to letting you help out with the development process. Thanks, West.andrew.g (talk) 15:03, 11 August 2010 (UTC)
- Andrew, any "rules at WikiProject:Spam" are always up for discussion and improvement. Feel free to bring up ideas on the talk page at Wikipedia talk:WikiProject Spam --A. B. (talk • contribs) 15:00, 17 August 2010 (UTC)
Wikipedia:WikiProject Vandalism studies
Could you please post the results of your study or a link to the results of your study on the Wikipedia:WikiProject Vandalism studies. The project may not be great, but I am hoping it can at least be a central reposatory of any such studies. Thanks. 69.3.80.34 (talk) 13:44, 17 August 2010 (UTC)
- For obvious reasons, I am going to decline this request. This is certainly a point of controversy and there is no need for the "strategies" used to be in the public domain at this time. West.andrew.g (talk) 13:56, 17 August 2010 (UTC)
Signpost article
Were you aware of Wikipedia:Do not disrupt Wikipedia to illustrate a point when you planned this experiment? —Stepheng3 (talk) 16:55, 17 August 2010 (UTC)
- If they don't do such testing in vandal's point of view, how could they develop tools to counter it? OhanaUnitedTalk page 15:39, 23 August 2010 (UTC)
- How can the Pittsburgh police fight crime without themselves committing vandalism, drug-dealing, robbery, rape, and murder?
- Somehow a lot of other Wikipedians over the years have managed to create vandal-fighting tools without attacking the project. I would suggest interviewing past and present vandals, observing the behavior of vandals and users, experimenting in testbeds, modeling and simulating, using logic and thought experiments, and undertaking controlled experiments on participants who have given their informed consent. Andrew's web C.V. says he's a Ph.D. candidate. I hope his program includes a course in scientific ethics.--Stepheng3 (talk) 16:18, 23 August 2010 (UTC)
Research Question: Blacklist stats
I just saw your unanswered query at Wikipedia talk:Spam blacklist#Research Question: Blacklist stats. I don't have any direct answers for you but I posted a several links for you where you might ask around for information. --A. B. (talk • contribs) 19:25, 17 August 2010 (UTC)
This is an archive of past discussions with User:West.andrew.g. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 | → | Archive 5 |