User:Doc James/PC trial

Previous data is here [1] and [2]

Proposal

I propose a trial of pending changes on WP:MEDs good articles of which there are about 100. The trial will generate numerical data that will help us to determine the risk versus benefits of this tool. The methods of data analysis will be determined ahead of time to decrease the risk of data mining.

Methods

Randomly assign articles to PC or no protection for one month, than reverse the assignment for a second month.
Data should be analyzed by at least 2 editors who are for PC and two who are against. Discrepancies between interpretation will be debated before the final presentation.

Analysis

Count the number of edits on each article (from both new and established users) under each system.
This will give an indication if pending changes stop new users from editing or not (ie. will PC drive new user away)
Determine the number of edits that are vandalism (both new and established users)
This will determine if PCs effects the overall rate of vandalism.
Determine the lenght of time vandalism is "live" under each system in minutes
This will determine if PCs is effective at decreasing live vandalism and if so by how much.
Determine number of inappropriate reverts by reviewers.
Does pending changes increase or decrease inappropriate reverts ( ie. will this be a tool in which one class of editors to more easily lord power over another )
Data will be presented for each individual article and combined together
This will give us an idea whether PC works for some article types yet does not work for others.

Determination of effectiveness

Once the final data is presented another straw poll can be conducted to see how the community feels about the results.

Suggestion

One should think a bit about statistical validity here. It seems likely to me that Animal testing, Homeopathy, and HIV get more abuse than everything else put together. If they ended up in the same group, the groups would be very unbalanced. I would be happier if one or two specific null hypotheses could be identified, so that procedures could be set up to properly test them. Looie496 (talk) 05:28, 6 September 2010 (UTC)

Above I state that all articles will have a one month period of PC and a one month period of no protection. So articles can not only be compared to themselves but then to each other in groups. But yes we will add obesity to those three an make sure two are in each group.Doc James (talk · contribs · email) 05:34, 6 September 2010 (UTC)

Thoughts

My thoughts on the proposal:

It is an interesting idea to trial a limited form of blanket PC-protection. I can see some people taking issue with a whole class of articles being PC protected, irrespective of a history of vandalism. Mind you, many others are arguing for that sort of thing so it may be worth at least trialling it (but that would not be my preferred approach).
I like the fact that you have thought about the methodology in advance of the trial. That was seriously lacking in the previous trial. You are obviously using some ideas from clinical trials, which is a good idea, but I must caution that a double-blind trial is obviously impossible so some more thought must be applied to the methodology. Specifically:
- The idea of involving a WikiProject that is in favour of Pending Changes seems sensible – much better than imposing it on people who do not like it – but it does run the risk of biasing the results – i.e. people checking for pending changes with great enthusiasm but not checking their watchlist quite as much as they might have done otherwise.
- It may become known amongst vandals that the trial is taking place on medical articles and so it will be in their power to skew the results by targeting one set of articles or the other.
- Articles which are repeatedly targeted by vandals (either because of the above point, or by chance) may have to be semi-protected and hence you will not get the clean comparison of non-protected to PC-protected envisaged in the methodology. A useful statistic to measure would be the number of pages that end up being semi-protected, but semi-protection is a relatively rare event so you would need a much bigger sample size to get meaningful data on this.

Yaris678 (talk) 08:18, 6 September 2010 (UTC)

WRT excess vandalism we simple deal with it during the one month trial period. If vandalism occurs it will provide data on why we need edit controls. We should keep in mind that some people think "all" should have the same right to edit no matter what.

WRT blinding as it is difficult that is why the data shall be analyzed by different editors. This will mean people cannot claim the results are just because someone who like / didn't like the idea analyzed the data.

WRT vandals deliberately trying to affect the trial. I consider this unlikely at best. Many vandals are school students who are here to insult their friends not investigate the inner working of Wikipedia.Doc James (talk · contribs · email) 14:32, 6 September 2010 (UTC)

OK. Now I am slightly less impressed with your understanding of clinical trials. Double-blind means that neither the doctor nor the patient is aware of which treatment is being given. It is used to negate a placebo effect, which could otherwise skew the results. It has nothing to do with who analyses the data. In this context, I was using the term to refer to the fact that it will be obvious to the "doctors" patrolling the articles which ones are PC protected. Therefore, it would help if we didn't skew our sample of "doctors" by considering pages related to a WikiProject that is all in favour of PC protection.

I think your characterisation of vandals is correct in most instances, but there are definitely some who just enjoy seeing how much they can cock things up.

I didn't understand your first point. Obviously vandalism will occur. I am talking about cases where there is a lot of vandalism and the page needs to be semi-protected so that reviewers don't get overloaded.

Yaris678 (talk) 17:04, 6 September 2010 (UTC)

Yes I am well aware of what "double blinding" is. If you can think of a way to do this great. I cannot but maybe it is just my lack of imagination at this point. Double blinding is more important when one is dealing with subjective results. As these results will be less subjective double blinding is less essential.

One thing about research is studies get done on topics that people are interested in. I have proposed a medical subject area as this is what I am interested in. If someone is interested in another area we could do another trial there. The more trials the better as even a single trial rarely completely answers a question.

As this is a trial seeing if reviewers will get overloaded is an interesting question. This will help us decide at which amount of vandalism we should move from PC to semi protection.

My first point concerns using semi protection during the trial. While we could I think it would be better if we did not and just deal with the vandalism. We have people who say semi protection is not needed either.Doc James (talk · contribs · email) 18:59, 6 September 2010 (UTC)

I can't think of a way to do a double-blind trial of PC. That was my point. Let me be clear: I can see some merit in the methodology you are proposing. However, I would prefer to take your ideas and adapt them, rather than do the specific trial you propose. That said, I can see some merit in your idea of running a number of different trials so there is an argument that you should just be left to get on with it. However, (1) not everyone is as willing as me to let bits of Wikipedia be experimented with (2) you need to be aware of some of the weaknesses in the methodology. I have expressed my thoughts on the weaknesses of your proposed methodology so you can reflect on them or reject them as you wish. Yaris678 (talk) 04:44, 7 September 2010 (UTC)

Yes appreciate the feedback. Doc James (talk · contribs · email) 16:31, 7 September 2010 (UTC)

My pleasure. Yaris678 (talk) 18:30, 7 September 2010 (UTC)

UncleDouggie actually suggested something similar to this, except that he also encouraged that the transparency that Wikipedia depends on to combat abuse of the system be removed, thus making it impossible to determine if the trial is going correctly and/or if pending changes is being abused. —Jeremy ^{(v^_^v PC/SP is a show-trial!)} 20:34, 3 October 2010 (UTC)

Objection

One of the major issues I have with this proposal is that there was a strong minority of editors who are absolutely opposed to pending changes, so doing an even more advanced trial will only increase the amount of users who are against it.--Gniniv (talk) 09:48, 6 September 2010 (UTC)

Huh? How does this have any bearing on the trial? The discussion is supposed to be based on reasoned argument. One needs data to support these arguements.Doc James (talk · contribs · email) 14:23, 6 September 2010 (UTC)

Volume

Two of the problems of the trial were that the small number of articles and large number of reviewers made it an unrealistic test, with reviewers tripping over each other to review edits, and because a large proportion of the articles it was applied to had previously been semi protected it became a test of could IPs do useful edits instead of could we detect a higher proportion of vandalism. We know from the German experience that flagged revisions on all articles doesn't reduce IP editing, but it does reduce the chance of vandalism slipping in on the occasions when recent changes is swamped. So if we want to test the effectiveness of this at screening out vandalism from unwatched BLPs I would suggest the following trial:

Identify a list of say 50,000 BLPs and randomly subdivide them into two groups.
protect one list of 25,000 with pending changes for one month.
At the end of the month announce the list of the other 25,000.
Give both sides a further month to trawl through the edits to the two groups of articles to find examples where vandalism and improvements took place in that month, and then try to compare them on:

Improvements: Did pending changes scare off the IPs who might have improved the 25,000 in the trial?
Vandalism: Of the edits that took place during the month and which weren't reverted, how many for each group were vandalism?
IP edits and total edits: How many edits did each group of articles have in the month before the trial and how many did they have in the month of the trial.

I think this would give us some objective measures, and if no-one knows which the control sample is during the trial it would be relatively hard to game this test. Ϣere SpielChequers 14:58, 8 September 2010 (UTC)

Yes that too would be an interesting trial and I would support someone who wished to carry it out. I am willing to carry the above trial out and while I agree it is small in size it is something that would not be to difficult to analyze. Also the majority of editors at WP:MED support PCs. I have little interest in BLPs however am trying to introduce Wikipedia editing into the medical curriculum. Doc James (talk · contribs · email) 04:19, 9 September 2010 (UTC)

Why would you do that when Wikipedia is supposed to be a general resource, not a specialized one? —Jeremy ^{(v^_^v PC/SP is a show-trial!)} 20:28, 3 October 2010 (UTC)