Wikipedia talk:WikiProject Vandalism studies/Study2/Archive1

This is an archive of past discussions on Wikipedia:WikiProject Vandalism studies. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

some suggestions for this study, new perspectives

User:DavidCBryant's comments from the last study

To do simple statistical analyses like the one you're after, spreadsheet software is appropriate. Has anyone entered the data into a spreadsheet program, like Excel, or Open Office? If not, I'll try to get it done in the next couple of days.

Minitab would be even better. If someone has access to that it's possible to get some great regression analysis. --Spangineer^ws (háblame) 02:25, 28 March 2007 (UTC)

Using the "random article" button to select articles for analysis makes a lot of sense. Selecting only the month of November for analysis makes less sense. Ideally, you might want to generate a random integer 1 through 12 (using a pseudorandom number generator) for each article selected for analysis, then analyze the edits for that month for that particular article. The problem with the procedure you used is that it may have introduced an unintentional systematic bias. Human behavioral patterns vary with the seasons, so it may be that you got an exceptionally high reading (or an unusually low reading) because people are grumpy (or benevolent?) in November, on average. Not that it's a big deal. Call it an opportunity for improvement.
There are quite a few 'bot authors on Wikipedia. The process of extracting the raw data, at least, might be automated somehow. For example, a 'bot might select the random articles, select a month at random, then extract only the edit history records you're interested in and dump the whole thing into one page somewhere, where you guys could study the data without doing so much data collection. Just a thought.
When you write your report, you might want to present the numbers two ways – both with and without the randomly selected articles that you discarded because no edits occurred in November. I'm only suggesting this because if you include that number of articles you can compute the likelihood that a randomly selected article is going to be edited in November, at least once in three years. OK, you'd probably want to dress it up a little and present it as the probability that a randomly selected article gets edited within one month. Anyway, it would just be good statistical practice to report how many articles you bypassed because there were no edits in November. It's part of full disclosure.
You've divided the edits into two classes, "vandalism" and "not vandalism". I think three classes might be more appropriate: "vandalism", "revert vandalism", and "not related to vandalism". I think the distinction is meaningful, and probably not too hard to make. Anyway, I'm not sure how you counted reverts in your raw data, but maybe I didn't read the report closely enough.

User:CMummert's comments from the last study

This (last) study was very interesting and informative, but the small sample size (100) makes the final numbers subject to a large margin of error. With the sample of 100, the estimate of 5% of edits being vandalism has a margin of error of about 4% at 95% confidence; so I conclude from your numbers that there is a very high chance the real vandalism rate is less than 9%. In order to have a 2% margin of error with 95% confidence, if the real percentage of vandalism edits is 5%, you need to sample about 475 articles. Fortunately, the total number of WP articles doesn't matter, only the number that you sample.
A second, more interesting, problem is that you are measuring the average percentage of edits per article that are vandalism. But there is another statistic that is equally valid: the average percentage of total WP edits that are vandalism. To see the difference, think about the extreme case where only one article on WP is ever vandalized, but it received 100% vandalism edits. Then your survey would show 0% average vandalism unless you were lucky enough to find that one article with the random article button. To measure overall averages, you would need to take a random sample of 1000 edits (I have no good idea how to do that without using a database dump) and determine how many of them are vandalism.

User:Reywas92's comments from the last study

In my opinion, the first vandalism study group of articles was too small. Much more than 100 should be used for this one. multiple percentages were created for what type of vandalism, reversion speed, and who made the edits, but only 31 acts of vandalism is much too small to make such statements.

For this, perhaps the random article selection should be less random, purposely choosing some more popular articles to find more vandalism. Some random articles have very few links to them so vandals can hardly even find them: Only 31 acts of vandalism in 100 articles. I know a lot of articles have ever been vandalized.

Also, the time frame could be bettered. Looking only at November edits is not comparative to now. We should look at current vandalism to understand the future, right? A historical look is good, but I think the present is more important. This would also make looking though histories simpler by not having to look so far back and skipping the others.

User:Xiner's comments from the last study

I second DavidCBryant's concern about focusing on any particular month. I just read the other day that the summer is a period of doldrums on Wikipedia, for example.

I second Reywas92's concern about the small sample size of acts of vandalism. While the number of 100 articles is itself small, the number of 32 vandalism edits definitely means that the conclusions regarding the authors of those edits (IP vs. registered) were invalid.

Perhaps Wikipedians knowledgeable about statistics can help ensure that your next study will be devoid of these deficiencies, because your efforts are definitely needed, and I don't want to see anyone's time wasted, especially when the topic is of such high importance. Thanks a lot, guys. Xiner (talk, email) 03:03, 29 March 2007 (UTC)

structure

i was thinking that the Scientific method is something we should start following for study 2, seeing as how the structure has been tried and true. it also makes for a better understanding in where all the data is going - study 1 was kind of a data-before-the-study kind of deal, and i think this one should be more formal and easier to read in a intro/hypothesis/data/results/conclusion kind of way. how do others feel? JoeSmack ^Talk 04:05, 2 March 2007 (UTC)

Non-vandalous Edit %s of registered and unregistered users

Background: The recent Wikipedia talk:WikiProject Vandalism studies/Study1#Draft conclusion for a sample of 100 articles states that 25% of vandalism reverting is done by unregistered editors and 75% is done by registered Wikipedians. It also states that "in a given month approximately 5% of edits are vandalism and 97% of that vandalism is done by anonymous editors."

For articles with a relatively high percentage of vandalism to total Edits or high volume of Edits, that's a good argument for semiprotect (administratively requested & possibly granted) status: It's discouraging for serious continuous editing to be distracted or deflected by a high volume of vandalism ("Why bother with this mess?"). The downside of semiprotection is no Edits (or reverts of vandalism) by unregistered editors.

In trying to decide whether semiprotect is worth the costs, here's a reasonable question to ask in construction of Study 2. Is the following ratio high or low for only nonminor Edits relative to total Edits?

(percent of nonvandalous Edits by unregistered users)/(percent of Edits by registered users)

If it is well below 1 (say, .25) and if the non-minor Edits improve the article more than minor Edits (big ifs), the ratio can be interpreted as indicating the article is not getting proportionate help from unregistered users, imposing higher Edit costs on registered editors. For articles with a low volume of vandalous Edits of course, semiprotect is not going to be so important (and conversely).

(The same question can be posed for minor Edits.)

If you have ever been interested in an article with a high volume of vandalous Edits and in need of big improvements, the above suggested stats and conjectures are likely to be more prominent as to the desirability of semiprotect. I hope Study 2 could collect the above stats to assist in trying to decide how articles might be improved fastest. Vandalous edits are a kind of tax of nonvandalous Edits and article improvement. Study 2 might be helpful in suggesting the benefits relative to costs of semiprotect. Comments welcome. --Thomasmeeks 11:29, 25 March 2007 (UTC)

How exactly would you set up the procedure for this? I'm curious to how this could be run. JoeSmack ^Talk 16:08, 25 March 2007 (UTC)

Well, the above refers only to data gathering, which would provide background, much as Study1 does. The above template by itself does nothing (not even deter, if my experience is representative). Clicking to "request unprotection" in the above template takes one to a section of Wikipedia:Requests for page protection. That project page is where one can request semiprotection. If the request is granted at that end, someone at that puts up the template and activates it, so that only registered users can do Edits.

As for a procedure for seeking semiprotect help, one look at the last 50 Edits and count the number of reverts (by inspecting that they were for vandalism). The data could help in a semiprotect request. If I have misinterpreted your question, I'll try again. Thx. --Thomasmeeks 18:20, 25 March 2007 (UTC)

So you're saying you want to count how many vandalisms there are before semi-protection is activated and a boilerplate is put up and how many vandalisms occur after? Basically you want to know how well semi-protection works? Interesting, I'd love to know how effective it is... JoeSmack ^Talk 21:46, 25 March 2007 (UTC)

Well, I wasn't trying to be that specific. I witnessed something very dramatic for one article (Economics}. Something close to 30 of Edits were reverted of the 100 Edits previous to semiprotect status on Feb. 17. All the reverted Edits were by unregistered users. After semiprotect became effective (through an administrative request granted as per above), quick automatic reverts virtually disappeared. What Study1 found applied very well to that article: Vandalism is overwhelmingly a problem from unregistered users. Only 3 Edits were reverted that I could detect in the 100 Edits that followed (rather than say 30). Those were not from vandalism and were accompanied by Edit summaries that well explained why there was a difference of editing opinion. So, semiprotect seems to work very well. --Thomasmeeks 23:51, 25 March 2007 (UTC)

Interesting indeed. Well, how would you like to turn this into Study 2? I'd love to explore this area of vandalism, and I think it'd be valuable! JoeSmack ^Talk 00:34, 26 March 2007 (UTC)

Oh and something like this proposal is currently running over at Obama article study. Check it out. Should we still make this study 2? JoeSmack ^Talk 00:38, 26 March 2007 (UTC)

I'm sure this has relevance as well: Don't protect Main Page featured articles/December Main Page FA analysis.JoeSmack ^Talk 00:42, 26 March 2007 (UTC)

I think that a big conclusion is highly likely to hold also for a sample larger than 1 (namely Economics. And data would be available for articles before and after semiprotect. Scanning the last article, what's missing is a comparison with registered users, which is what the above was getting at. --Thomasmeeks 02:05, 26 March 2007 (UTC)

I was thinking a much larger sample size than 1 article as well. I was thinking 50-100. What do you think? And how would you randomize the sample selection, or how would you approach the issues of selection bias? JoeSmack ^Talk 02:11, 26 March 2007 (UTC)

That might be one for statistics folks on their Talk page or that of Talk:Regression analysis. As small a sample of articles as would do the job (say 50 or less) would make it easier. High volume articles subject to frequent reverts would add relevance. If there was a way of determining which general areas had the highest traffic volume (say by Wiki searches), that might be used. Clustering by different general topics social sciences, philosophy, , or controversial subjects (Death and resurrection of Jesus etc.), are possibilities. I think the object should be not randomization but clustering to address practical problems. Assignment to persons on those Talk pp. would spread the work load. --Thomasmeeks 12:26, 26 March 2007 (UTC)

(unindent) well we can't leave msgs on Talk:Statistics and stuff like that, talk pages for articles are about the articles and not experts in the field. what you can do is ask a statistician from Category:Wikipedian_statisticians to give some thoughts, but dont ask like 10 (no spamming 'round here). as for work load nothing much needs to get spread round, we did the last study with three people and that was fine. so, what is sort of initial mini-list of articles you were thinking of making? i'm curious as to how you would cluster articles, and i'm not sure how your highest traffic volume approach would work; do searches get a traffic score that i don't know about? JoeSmack ^Talk 12:44, 26 March 2007 (UTC)

Points well taken. I' assuming that Wiki searches via Google do give wt. to traffic volume. I gave examples of clustering. I can't do better than that. If anyone think of a cluster that looks interesting, so be it. It was only a suggestion. --Thomasmeeks 13:23, 26 March 2007 (UTC)

I guess wikipedia has 100 most popular wikipedia articles, but I don't know of any google traffic rating system. The examples you gave above petter out after about 4, I was wondering if you had a category list in mind, like the one WP:1.0 uses or something? I guess I'm wondering a) how generalizable we want to make this and b) how to accomplish that without getting sticky with bias. JoeSmack ^Talk 13:34, 26 March 2007 (UTC)

Well, here's a way to address that. Consider a broad category (what I called "general topics" above) within which there are data points (called "observations" below) such as top 100, entertainment, social sciences, religion, etc. I'm assuming a search of Wiki within a category of Wiki using Google reflects traffic to some degree. The clustering (classifying) could be according to those categories, such as 10 from the top 100, 10 from entertainment, 10 from social sciences 10 from religion, and the rest. The advantage is that for each category, there would be some measure of vandalism frequency as to each observation in the sample. That would be the dependent variable. Independent variables would the 0-1 dummy variables for each category data point ("no" = 0, "yes" = 1 for a given category). Then a linear regression could be run for vandalism frequency as a function of each of the categories to determine statistical significance of any differences in categories as to vandalism rates. --Thomasmeeks 18:44, 27 March 2007 (UTC)

I'm still not clear on exactly how to intend to get the top 10 social sciences articles etc. via a google traffic measure. that top 100 link is the only article popularity resource i know of...

Also, what happens when one item falls clearly into two categories? JoeSmack ^Talk 04:59, 28 March 2007 (UTC)

It could be that X different social sciences would more or less coincide with the top 10. A Wiki search even without Google but with added terms to narrow the search in successive searches might work as might an advanced Google search of Wiki with more and more NOT terms in successive searches. Stating method used would give transparency, even if subsequent bias was found. More than one category is fine, in fact good if interaction effects are suspected. Then the combined effect might be significantly more (or less) than the sum of their parts, picked up by a multiplied-category variable. --Thomasmeeks 11:54, 28 March 2007 (UTC)

Study3?

With Study2 wrapped up, let's look ahead. Actually the following could be part of Study2:

What's the effect of semiprotect on the rate of non-vandal-or-revert Edits?

In favor of an increase in that rate, registered users might be encouraged to do less reverting (since say 95 percent of vandalous Edits are by unregistered usere) leaving more time for non-revert Edits. In favor of a decrease in that rate is that non-registered users would be blocked from any Edits with semi-protect. I'd guess that if the percentage of Edits by registered users is high, you'd get one result and vice versa, so that percentage would be a good control variable (an additional independent variable in the regression analysis). --Thomasmeeks 00:58, 29 March 2007 (UTC)

Do you mean you'd like to make this Study 3 leaving study 2 here to recent-changes vandalism approach? Either way I think it sounds like a terrific subject for a study - count me in. JoeSmack ^Talk 02:25, 29 March 2007 (UTC)

I didn't know about recent-changes vandalism approach. I was trying to be facetious (as to "Study3"). But I am serious about the substance of this section. It's a real drain in energy and time to deal with vandalism or bad Edits, overwhelmingly from unregistered users, which ~~Study2~~ Study1 has persuasively documented. Further documenting the effects, whatever they are, of registered vs. non-registered users bad Edits I hope could suggest constructive action. --Thomasmeeks 12:35, 29 March 2007 (UTC)

Ahh, I think you mean Study 1 has documented, there is the confusion. You already have my vote as noted above; I agree. You should start setting up your procedure! :) JoeSmack ^Talk 12:50, 29 March 2007 (UTC)

Your correction is noted above. (Yes, and I do I sometimes miss the obvious;) I'm assuming by procedure you mean nitty gritty details to make it work. Someone looking at the above could do that as well or better than me (recruited for example through a Help Desk or on this page). I'd do what I could as a resource person, but I feel overwhelmed by other matters. One way of getting focus might be to concentrate on 1-month-before and 1-month-after date of semi-protect. There are archived records of semi-protect requests granted through Wikipedia:Requests for page protection (assuming it lasted at least a full month). The sample selection there would already be skewed toward high bad/good Edits ratio for articles getting semi-protect, but that's where bad Edits sre likely to be most serious and where effects of semiprotect on the number registered and unregistered usere would be available from the article history). If there was a way of accessing semi-protects by category (social science, religion, etc.), then the above regression framework might be feasible. Otherwise, simple before and after comparisons without categories might be a default. Suppose someone with a few keystrokes could access all 2006 articles granted semi-protect and with a few more keystroke restricted by their categories (social science, etc.), say by a find-on-this-page search. From that, one could narrow the pool of articles by category. If that is not possible, then simple before-and-after semi-protect comparisons could still be very interesting. --Thomasmeeks 14:24, 29 March 2007 (UTC)

Correlate page popularity with amount of vandalism

I'd be interested to know what sort of relationship there is between how many edits a page gets in a certain amount of time and what percentage of those edits are vandalism. For this study, using random article to find lesser-viewed articles might be appropriate, but more popular pages would also have to be selected, perhaps an article from the list of most-frequently vandalized pages. If I had to guess, I'd say that the graph of percentage of edits that were vandalism, as compared to useful edits, vs. page popularity would be exponential, and the graph of percentage of edits that were vandalism, as compared to total edits, vs. page popularity would be logarithmic, but that's just a hypothesis. Feel free to drop me a line on my talk page if you'd like my help with anything, I love numbers. shoy 16:00, 27 March 2007 (UTC)

This sounds something like the Obama article study. Popular page, freq. vandalized, comparing useful edits to vandalism ones. You might use that one as an extreme and a lessor know one as another and use those two as a pilot and see how you'd like to approach this more specifically. Interested? JoeSmack ^Talk 05:02, 28 March 2007 (UTC)

I could use some of the data from your first study as well. shoy 16:59, 30 March 2007 (UTC)

Recent Changes?

Forgive me for asking a question that may have been asked before, but couldn't you also just go through the list of the Recent Changes that have been made? Yes, it is biased to the immediate, but if you analysed the last 50 edits at (say) three times during the day on three days of the week, wouldn't that work just as well? Iorek85 23:02, 27 March 2007 (UTC)

Yes, I came here to suggest something along the same line. I don't think that analyzing "random articles" for vandalism offers a good account. 80% of random articles are crud -- the long tail of the encyclopedia that have little attention paid to them by anyone, including vandals. I suggest a vandalism survey by "diff". Vandalism is per edit, not per article, so I think it makes sense to survey the edit pool, not the article pool; it will be naturally weighted toward what people are actually doing with (to!) Wikipedia. If you want to constrain the pool, you can survey some diff numbers to find the date range you want. If you randomly generate numbers within this range, and throw out the non-article-space diffs, you end up with a better sample of what's going on. –Outriggr § 03:42, 28 March 2007 (UTC)

Wow, an approach that was right in front of my eyes! It's only a few items down in the sidebar, recent changes! ;) A much bigger picture approach too, very different than what I was conceptualizing of vandalism.

When we were piecing together Study 1, i think the idea was to show how much vandalism you might bump into in the wild. No one likes reading through information about their favorite scientist only to find the end of the paragraph says "johhny is a total alcoholic, lLOL!!!". Vandalism from an 'integrity' standpoint: how much of 'good content' is bad content waiting to be discovered. Also it gave an opportunity to show how vandalism has increased or decreased over the years (although it appears not very much in either direction).

I think thanks to recent changes patrol and a lot of hard work on the community's part a shit-ton of vandalism and cruft articles never see more than 5 minutes of live time before being removed or deleted. The first 5 minutes of an article and edit's life would be a completely different picture and indeed perspective on vandalism, and I think it'd be really cool to pursue this idea. Anyone up for it? JoeSmack ^Talk 05:08, 28 March 2007 (UTC)

I can see where you are coming from with looking at with Study 1; article integrity is important, as is the time to revert, and both of those are missed with recent changes. But for simple % vandalism edits, RC would be the place to look, I think. As you say, with RC patrollers, the vast majority (especially obvious vandalism reverted by bots) of vandalism doesn't last long. You could work out a rough estimate of how much vandalism escapes by subtracting the 'revert' edits from the 'vandalism' ones, but one problem is that to look back five minutes would require manually checking about 500 edits, by the look of it.

Another idea that just came to me would be to take the list of say 100 recent changes. Take down the names of the articles that are vandalised, and the time of vandalism. Then check back (at any time afterwards, which is the good part) and check to see how long it took for that vandalism to be reverted. It'd just another way of gaining 'time to revert', but biased to the more heavily edited articles.

And if you're less masochistic, on the article integrity front, you could give a 'chance the page you are looking at is vandalised' statistic, which would be cool. At least it would be simple; just use the random button and note if the page you are viewing is vandalised or not vandalised. Iorek85 11:57, 28 March 2007 (UTC)

- - Recent changes studies on a very small scale HAVE been done. For examples, see one I did: [1] and one done by Opabinia regalis (talk · contribs): [2]. These may serve as models for how to do study 2. There are LOTS of good possible studies that could be done by tracking recent changes:
    - Correlation between anon/logged status and type of edit (Good/Test/Vandalism)
    - Correlation between time of day and type of edit (Good/Test/Vandalism)
    - Correlation between level of activity and type of edit
    - Correlation between length of article and type of edit
    - Correlation between type of article and type of edit
    - Correlation between user experience and type of edit (do users with more edits make better edits, regardless of anon/logged status?)
  - Just some ideas to mill over. There are LOTS of studies to be done, and we may not get to all of them, but they could ALL be useful in driving Wikipedia policy decisions, and helping to improve the encyclopedia. Again, check out the earlier RC studies. They could shed light on how this and future studies can be done. --Jayron32|talk|contribs 18:10, 28 March 2007 (UTC)

(unindent) i think that this recent changes approach would be a great method both in its simplicity and its efficacy. I do think it is most susceptable, as already suggested by Jayron32 (thanks for those other studies, awesome) to time-of-day vandals. For instance, check out this graph made by User:Nick showing that at 10pm there is about twice as many external links being added than at 10am all days of the week. A similar trend might be present in recent-changes vandalism. I think this kind of study would reveal that too, which would be interesting as hell; a strength and a weakness! JoeSmack ^Talk 02:37, 29 March 2007 (UTC)

Volunteering

You guys did a terrific job on Study 1. I'm not sure how you divide up the duties, but when you have settled on a method for Study 2, please feel free to drop me a message assigning me a task (e.g., Data Points 20 - 30). Would love to be some small help, as time permits. Jonathan Stokes 04:30, 28 March 2007 (UTC)

Absolutely! We'll rouse you for tasks, and of course value any input before and after tasks too. :) JoeSmack ^Talk 05:10, 28 March 2007 (UTC)

I don't have much input on your methodology...I'm happy to be a workerbee. FYI, I just blogged your first study, hopefully with all appropriate credits and disclaimers. Tomorrow, I expect this should get picked up in this Wikipedia blog aggregator and this one, too. My hope is to help you draw attention and volunteers to the project. You're doing impressive work here, and it deserves recognition. Jonathan Stokes 05:56, 28 March 2007 (UTC)

Thanks for the praise. Your volunteering made me realize that we should probably set up a volunteer section so I have done that below. Remember 12:40, 28 March 2007 (UTC)

We should also do what we can to make it really easy to volunteer. We should put together tasks that are very specific, with detailed instructions of how to do it. The way I envision the random edit study is that we'll pull our list of random edits, put them here, and then have each edit as a specific task that we're looking for someone to complete. That coupled with good instructions, should make it fairly easy for casual interested people to take a few swings with the proverbial axe. The reason this is important is that the strength of our study will have a lot to do with the amount of data we've collected. We've got to think in terms of numbers far greater than ourselves. A volunteer drive of sorts. Martschink 03:52, 1 April 2007 (UTC)

Volunteers section

Please place you name here if you are willing to help us gather or analyze data for our next study:

Random edit study

I also had an interesting discussion on my talk page that I thought I would move here. In order to get a random sampling of edits we could randomly choose numbers and then go to that specific edit number in the wikipedia edit index (which I had no idea existed). What do others think? Remember 12:37, 28 March 2007 (UTC)

The only suggestion I would have for your vandalism study would be to take a random sample of edits instead of a sample of articles. The reason being, as it stands, heavily edited pages will have (what would seem) a disproportionate weight in the resulting statistics. If you are trying to estimate the true rate of vandalism in terms of vandalized edits per total edits, then the effect of this will be increased variance of your statistics in the best case (that vandalism rates are uniform across articles), and increased variance PLUS BIAS in the worst case (that some articles are systematically vandalized at a higher rate than others). So if the goal is to estimate the encyclopedia-wise rate of vandalizing edits, my recommendation would be to randomly sample edits, not articles. Btyner 22:35, 26 March 2007 (UTC)

Yes but how can we do that when there is no random edit button. Any ideas? Remember 22:43, 26 March 2007 (UTC)

Do the randomization with your own software; randomly draw from integers {a,...,b} where a corresponds to (say) the first edit of the year and b corresponds to a fairly recent edit. Say your pseudorandom number generator tells me to check edit number 87310788. Then go to the corresponding edit and compare to previous version to see if it was vandalism. There may be a more automatic solution but you'd have to ask someone more advanced in such things. Btyner 22:53, 26 March 2007 (UTC)

I had no idea that you could check edit number 87310788. Where can I find out about this edit index? Remember 22:57, 26 March 2007 (UTC)

Try Wikipedia:Odometer which has information in this vein. Note that for some reason the link I gave above originally had /w/ instead of /wiki/ in the title but now I've fixed it. Btyner 23:02, 26 March 2007 (UTC)

Woooow, this method would be iron clad but a whole lot of work. This is a strong procedure, but we'd need some really technical folk to make sure it all worked out in the wash. Are people into this? JoeSmack ^Talk 02:40, 29 March 2007 (UTC)

A random edit thing would require a datadump, I think. I'd like to point out, too, that I've tried RCP a few times, but have been discouraged because there seem to be so few vandals there. I've heard that you can only effectively find vandals on RCP with software tools. Xiner (talk, email) 03:05, 29 March 2007 (UTC)

that's not entirely true; i use the recent changes button, click "hide logged in user", and with some practice it isn't hard to spot vandals.. usually more than 1 in 5 of the ones i check like that are vandals; but that's far from random, since you can see the edit summary, and change of number of characters. 131.111.8.96 08:59, 29 March 2007 (UTC)

This seems to be a good idea to randomly sample edits.. as pointed out above, the random article option gives too much weight to articles with a small number of edits. Using this method it also isn't hard to focus on a specific time period.

For example, say we wanted to sample 100 edits between January 1 and February 1. Then we just choose 100 random numbers between 97631500 and say 104715003 (edits 104715000 to 104715002 turned out to be edits to talk pages). Choosing 100 random numbers in that range shouldn't be too hard. Then use the links as i've written, for example http://wiki.riteme.site/wiki/index.php?oldid=97631500 and click on "diff" at the far left of the header to see what the edit actually was. 131.111.8.96 09:32, 29 March 2007 (UTC)

Data to collect for random edit study

I've started this section to put together the possible data to collect for each given edit. This assumes we do a random edit study. What data can we collect to help make the study a robust source of data? With more data, we can possibly add more to the sum of knowledge about vandalism beyond just the anonymous/registered distinction. Some of these have been mentioned above, but I thought this was important enough to collect together under its own section. And this is far from complete. These aren't even specific proposals, just thoughts. Dig it? Please feel free to add. For each edit:

article length
number of wikipedia links to the article
number of edits to the article within 7 days (or some time period) of the edit
time between chosen edit and immediately preceding and following edit
number of categories the article falls under
edit date and time
article age
was article protected within previous N days?
was article protected within subsequent N days? Martschink 22:57, 1 April 2007 (UTC)
how long it took to revert the article
who reverted the article

Remember 23:51, 1 April 2007 (UTC)

Formal proposal for study

Since there has been several different studies proposed and discussed on this talk page, I think it would be helpful if we dedicate this section to formalizing each proposed study and then cast votes of support for which study is the most useful and which should be done next. Remember 15:51, 29 March 2007 (UTC)

It's looking obvious that the people want both of these studies done so they will probably end up being Studies 2 and Study 3. I think we should give the voting another week (until April 7th) and then figure out which study to do and set that up on the study 2 project page. Remember 01:06, 1 April 2007 (UTC)

Well, I think it is obvious that the random edit study won since there have been no votes in awhile. Let's get it started.Remember 12:06, 6 April 2007 (UTC)

I agree: to the next section! lets work out the foundation for this baby. JoeSmack ^Talk 19:47, 6 April 2007 (UTC)

Protection study- a study designed to see the effects of protecting an article from use by unregistered users.

Proposed methodology -

Support or Oppose and whether you think it should be the next study (please place your name and any comments indented below)

Support. I'd lean more toward this if it's either-or.* Here's where the biggest problems occur, so it might be easiest to detect effects from semi-protect status. Each article in the study would be a sample observation with measures of the dependent and independent variables of interest. All the observations would comprise the data set on which the regression analysis could be done. The question for either study should be what's to be learned beyond Study1. *With multiple regression, however, it can be both: it implies all 0s for category variables from the random part of the sample (if categories are used) and more 0s for the semiprotect variable (since most articles at random are not semiprotected). --Thomasmeeks

Support - i think this study would directly support or help change policy for the better in Wikipedia, and thus would but of the utmost value. JoeSmack ^Talk 20:01, 30 March 2007 (UTC)

Support - I completely agree with Joe. Certainly since it's something that has immediate effect of a continuesly used policy. JackSparrow Ninja 20:17, 30 March 2007 (UTC)

Support - Completely agree. Alex Jackl 20:48, 30 March 2007 (UTC)

Random edits study- an study that would example a random sampling of edits to further discover the amount of edits that are vandalism related edits and the time it took to revert these edits. Study could also explore the amount of edits contributed by non-registered and registered users.

Proposed methodology -

Support or Oppose and whether you think it should be the next study (please place your name and any comments indented below)

- Support It is important to determine what the activity of non-reg users looks like vis-a-vis the activity of reg users. A random sampling of edits to random articles spread over several times a day and several days should be good. We should aim to look at each edit, classify it as a "Keep" edit (for edits whose changes, whether major or minor, represent a positive contribution to Wikipedia), a "Revert-neutral" (like test edits, original research, or otherwise good faith but poor execution) and a "Revert-vandalism" (for edits that represent an obvious attempt to disrupt wikipedia). We should also note whether each edit was by a registered username or an IP user. Analysis and conclusions should focus what portion of IP edits and what portion of Username edits are represented by each category. --Jayron32|talk|contribs 18:25, 29 March 2007 (UTC)
- Support per my original comment. With the effort of multiple users, a larger sample/samples could be reviewed. I'd help if this methodology is used. –Outriggr § 23:30, 29 March 2007 (UTC)
- Support- Now that we know how to gather random edits, I think we can put together an improved study that will tell us much more about contributions to wikipedia. Remember 02:40, 30 March 2007 (UTC)

Comment: Outriggr, I agree with your earliest comment at Wikipedia talk:WikiProject Vandalism studies/Study2#Recent Changes?. That suggests a non-random selection of articles. Could you clarify? A random sample can have advantages, e.g., as to predicting an election outcome. But it's an expensive way of answering a question when interest is on a category within articles, for example high traffic-volume articles, those in the social sciences, etc. Within each each of those article categories, selection should be random. But that is what I took Random edits study immediately above to mean. --Thomasmeeks 15:56, 30 March 2007 (UTC)

Thomasmeeks, what I meant was what others have since mentioned. Each article edit has a number associated with it, and the edit can be retrieved by URL. If the first edit on February 1 was numbered 101,000,000, and the last edit on February 28 was numbered 104,200,000, then we have a range to draw random numbers from to sample edits in February. It is a random edit sample and a non-random article sample, because edits are not distributed evenly among articles... by a long shot. The only caveat is that a randomly chosen number may lead to a non-article-space edit, in which case an edit one higher is selected, or another random number is used, until an article edit is found. My interest in this approach is that it is based on current article activity. One article may be 100 times less likely a victim of vandalism than another, such that random article analysis might have results all over the map (I would think).

Here's my example. In Excel I generated 10 numbers between 111,500,000 and 119,000,000 (approximating the month of March), concatenated them to the URL that provides a diff, and loaded 'em up.

- - - http://wiki.riteme.site/w/index.php?title=Index.php&diff=117783178 - article, not vandalism
    - http://wiki.riteme.site/w/index.php?title=Index.php&diff=115702034 - missing diff, go to
      - http://wiki.riteme.site/w/index.php?title=Index.php&diff=115702035 - a, nv
    - http://wiki.riteme.site/w/index.php?title=Index.php&diff=117955306 - a, presume nv
    - http://wiki.riteme.site/w/index.php?title=Index.php&diff=113985539 - a, vandalism
    - http://wiki.riteme.site/w/index.php?title=Index.php&diff=115054479 - a, nv
    - http://wiki.riteme.site/w/index.php?title=Index.php&diff=118664730 - missing diff, go to
      - http://wiki.riteme.site/w/index.php?title=Index.php&diff=118664731 - user talk, go to
      - http://wiki.riteme.site/w/index.php?title=Index.php&diff=118664732 - user talk, go to
      - http://wiki.riteme.site/w/index.php?title=Index.php&diff=118664733 - talk, go to
      - http://wiki.riteme.site/w/index.php?title=Index.php&diff=118664734 - talk, go to...
      - Lesson learned - consecutive edits may not be independent. Generate another number instead.
      - http://wiki.riteme.site/w/index.php?title=Index.php&diff=114589697 -a, vandalism
    - http://wiki.riteme.site/w/index.php?title=Index.php&diff=114011320 - a, nv
    - http://wiki.riteme.site/w/index.php?title=Index.php&diff=112108400 - a, nv
    - http://wiki.riteme.site/w/index.php?title=Index.php&diff=111677724 - a, nv
    - http://wiki.riteme.site/w/index.php?title=Index.php&diff=116537809 - a, nv

There you have it, 20% of edits are vandalism. :^O –Outriggr § 00:28, 31 March 2007 (UTC)

Thx, Outriggr. Your explanation and use of italicized terms was very clear. Hope you'll add your voice for non-random article selection above. --Thomasmeeks

- Support. The only comment I'd add is that an analysis like this should tally ordinary edits, vandalism, and reversions of vandalism, correlating them with anonymous and registered editors. A possible extension would be to split registered editors into editors with real (sounding) names and editors with clear pseuedonyms. I'm not certain what, if anything, that would turn up but it is a variable worth considering. --SteveMcCluskey 22:46, 31 March 2007 (UTC)

- Support. I think the study of vandalism in general has more broad appeal than the policy-specific study. Broad appeal can help us get people involved who we can then turn to for help later for a subsequent policy-specific study. Martschink 03:30, 1 April 2007 (UTC)
- Support We should try for something like 500 edits though, just to make sure we catch enough vandalism. Xiner (talk, a promise) 16:11, 1 April 2007 (UTC)

Study on "masked" vandalism A study of a select series of articles in which vandals disguise their edits as legitimate edits and often accuse legitimate editors defending the site as vandals. I am not sure how to select the sample or what the criterion would be to evaluate that. I guess this is more of a problem statement than a study suggestion. I will think on how to do that. Alex Jackl 20:46, 30 March 2007 (UTC)
- Weak Oppose In my experience, this form of vandalism is rare. It'd be difficult to find a large enough sample. Perhaps I don't frequent controversial articles enough, though. Xiner (talk, a promise) 16:10, 1 April 2007 (UTC)
  - Maybe I frequent them too much!!!!! Been spending sometime in the last year on the Landmark Education site so I may be jaded. I agree. I couldn't come up with good controls or criterion and without a rubric you can't do a meaningful ,so I am going to just throw my hat behind whatever the consensus decides!!!!Alex Jackl 00:32, 2 April 2007 (UTC)

Other studies- suggest other studies here

Random Edits study, formulation and structure

Alright, as it looks like the Random Edits study one the vote above by 2:1 (see above), lets get the blueprints laid out. I was thinking this time we should follow a breakdown using the scientific method. Let's start filling in areas we can and set things up to start work. (omitted until work is done: Abstract, Data, Results and Discussion, Conclusion, References). JoeSmack ^Talk 20:06, 6 April 2007 (UTC)

I went ahead and added this to the main page. I think we can start working on stuff there now unless anyone objects. Remember 16:45, 7 April 2007 (UTC)

I was WP:BOLD myself and proposed an outline procedure. Make any changes and comments as need, including scrapping the whole thing. This is just one idea. --Jayron32|talk|contribs 17:51, 7 April 2007 (UTC)

Some have talked about a recent changes study, but I think the consensus was to get a truely random sampling we need to look at randomly selected edits (which would be selected by a number gneerator and then going straight to that edit). So I have moved your suggestions to below for further comments. Any other feelings? Remember 18:19, 7 April 2007 (UTC)

My understanding is that we're going to go forward with the random edit study, not the recent changes version. That said, I'd like to see the scope of this study to be massive. We need to look at far more edits than we will be able to look at ourselves. I think we should proceed by (1) Figuring out what data we want to look at for each edit. See the working list above. (2) Spelling out a step-by-step procedure for examining an edit. This way we can easily harness the work of new volunteers. (3) Generating the list of random edit numbers that we need data for. (4) Start filling in the data. I've talked this over with another contributor, so I'm going to make to further specific proposals. First, that we examine 5000 edits. That sounds like a huge number, but it will give the study credibility and attention. That, coupled with a significant amount of information about each edit will also make this study useful. Economist dig massive amounts of information, and our study can help provide that (along with whatever our initial conclusions are). Which leads me to my second proposal, that we collect the data into a comma separated value (CSV) chart. CSV is a common denominator for spreadsheets and databases and will make it easy to load for sophisticated analysis. I believe that is the way to add additional value to the study. Martschink 00:29, 8 April 2007 (UTC)

Moved proposed procedure

Each participant who is willing is assigned a time to analyze edits from the Recent Changes page.
1. Sampling should include the all article edits listed in the 50 total edits at a single loading of the Recent Changes page
2. Times and days should be evenly spread over a long enough period to eliminate Time of Day and Time of Week effects.
  1. PROPOSAL: Times are chosen so that edits can be checked about once an hour for a week. To minimize total work load, consider the following times (all times UTC):
    1. Monday: 00:00, 03:15, 06:30, 09:45, 12:00, 15:15, 18:30, 21:45
    2. Tuesday: 01:05, 04:20, 07:35, 10:50, 13:05, 16:20, 19:35, 22:50
    3. Wednesday: 02:10, 05:25, 08:40, 11:55, 14:10, 17:25, 20:40, 23:55
    4. Thursday: 00:00, 03:15, 06:30, 09:45, 12:00, 15:15, 18:30, 21:45
    5. Friday: 01:05, 04:20, 07:35, 10:50, 13:05, 16:20, 19:35, 22:50
    6. Saturday: 02:10, 05:25, 08:40, 11:55, 14:10, 17:25, 20:40, 23:55
    7. Sunday: 00:00, 03:15, 06:30, 09:45, 12:00, 15:15, 18:30, 21:45
  2. Alternate times could be chosen to reduce or increase the total workload depending on the level of detail we want.
Each data point will consist of all of the article edits listed in the 50 total edits at a single loading of the RC page at the times determined (possible as above)
Each edit is assigned by user making edit:
1. IP address
2. Registered user
Each edit is assigned a quality measurement
1. Good faith, beneficial (an edit that is likely to be kept as is or with minimal editing)
2. Good faith, but bad edit (obvious POV, very bad copyediting, unreferenced OR, etc.)
3. Vandalism, minor (test edits, etc.)
4. Vandalism, major (screed, nonsense, obvious vandalism)
5. Revert of vandalism
Once gathered, the all data points will be analyzed to calculate the following:
1. For each class of user (IP or Registered), the % of total edits that fall into each quality category.
2. As a secondary analysis, besides the total numbers, each data point could be combined with other data points as needed to find data categorized by Time of Day, Day of the Week, etc. etc.

Questions for Vandalism study

I posted the following on the Village Pump and recieved the following reply Remember 19:52, 30 April 2007 (UTC):

Village Pump posting

We are working on conducting our second vandalism study here Wikipedia:WikiProject Vandalism studies/Study2, and I had some technical questions that I thought someone here could help us with. Is there anyway to determine the amount of data added or subtracted from an article from a specific edit? I know that this is displayed on the recent changes page, but can you get that information for any edit? Also is there any way to easily measure the size of an article at a specific point in its history? If anybody has any expertise in these areas please let me know. Remember 02:44, 15 April 2007 (UTC)

You might be able to get the content of two revisions, calculate their length, and then subtract the two values. Using api.php for this seems best, since it's quick. For example, take the following function in JavaScript, which returns the change in size between revisions using Ajax in Firefox (I should learn another programming language for doing this rather soon):

var XMLobj;
var XMLdoc;
var diffNum;

function getNumDiff(title, crevid) {
  try {
    XMLobj = sajax_init_object();
    XMLobj.overrideMimeType('text/xml');
    XMLobj.onreadystatechange = getDiffContent;
    XMLobj.open('GET', wgServer + wgScriptPath + '/api.php?action=query&prop=revisions&rvprop=content&format=xml&rvlimit=2' + '&rvstartid=' + crevid + '&titles=' + encodeURIComponent(title), true);
   XMLobj.send(null);
   } catch(anError) {}
}

function getDiffContent() {
  if (XMLobj.readyState != 4) {
    return;
  }

  if(XMLobj.status != 200) {
    alert ('There was an error.');
    return;
  }

  var XMLdoc = XMLobj.responseXML.documentElement;

  if(!XMLdoc) {
  return;
  }

  alert(getRvData(XMLdoc.getElementsByTagName('rev')[0]).length - getRvData(XMLdoc.getElementsByTagName('rev')[1]).length);
}

function getRvData(rvObj) {
  var rvNodes =  rvObj.childNodes;
  var rvString = '';
    for (nodes in rvNodes) {
      rvString += rvNodes[nodes].nodeValue || '';
    }
  return rvString;
}

Then try, say, getNumDiff('Kitten', 122843426). This alerts the difference in revision size—in this case, 39 characters—in .25 seconds. There is probably a better way of doing it... an actual programming language may be an option. Then there's always wget. Gracenotes ^T § 04:34, 15 April 2007 (UTC)

For the first question, you could wait until revision 20221 of MediaWiki is activated on Wikimedia Foundation wikis which will add the number of bytes add/removed to all edits in article histories. However this requires an update of all the rows containing edit information (over 127,000,000 on the English Wikipedia alone at time of writing), so it hasn't been activated yet. Graham87 12:00, 15 April 2007 (UTC)

See special:version for the current revision number. Graham87 12:01, 15 April 2007 (UTC)

Moved proposed procedure

Each participant who is willing is assigned a time to analyze edits from the Recent Changes page.
1. Sampling should include the all article edits listed in the 50 total edits at a single loading of the Recent Changes page
2. Times and days should be evenly spread over a long enough period to eliminate Time of Day and Time of Week effects.
  1. PROPOSAL: Times are chosen so that edits can be checked about once an hour for a week. To minimize total work load, consider the following times (all times UTC):
    1. Monday: 00:00, 03:15, 06:30, 09:45, 12:00, 15:15, 18:30, 21:45
    2. Tuesday: 01:05, 04:20, 07:35, 10:50, 13:05, 16:20, 19:35, 22:50
    3. Wednesday: 02:10, 05:25, 08:40, 11:55, 14:10, 17:25, 20:40, 23:55
    4. Thursday: 00:00, 03:15, 06:30, 09:45, 12:00, 15:15, 18:30, 21:45
    5. Friday: 01:05, 04:20, 07:35, 10:50, 13:05, 16:20, 19:35, 22:50
    6. Saturday: 02:10, 05:25, 08:40, 11:55, 14:10, 17:25, 20:40, 23:55
    7. Sunday: 00:00, 03:15, 06:30, 09:45, 12:00, 15:15, 18:30, 21:45
  2. Alternate times could be chosen to reduce or increase the total workload depending on the level of detail we want.
Each data point will consist of all of the article edits listed in the 50 total edits at a single loading of the RC page at the times determined (possible as above)
Each edit is assigned by user making edit:
1. IP address
2. Registered user
Each edit is assigned a quality measurement
1. Good faith, beneficial (an edit that is likely to be kept as is or with minimal editing)
2. Good faith, but bad edit (obvious POV, very bad copyediting, unreferenced OR, etc.)
3. Vandalism, minor (test edits, etc.)
4. Vandalism, major (screed, nonsense, obvious vandalism)
5. Revert of vandalism
Once gathered, the all data points will be analyzed to calculate the following:
1. For each class of user (IP or Registered), the % of total edits that fall into each quality category.
2. As a secondary analysis, besides the total numbers, each data point could be combined with other data points as needed to find data categorized by Time of Day, Day of the Week, etc. etc.