Wikipedia talk:WikiProject Medicine/Copyright

Links mostly broken

User:Jmh649, User:ערן: Great to see this moving forward! I'll be following with great interest. In the updates so far, it looks like the links to matching content are being truncated, and most of them are broken. Also, are you planning to filter out the diffs that are actually reverts to previous versions?--Sage (Wiki Ed) (talk) 19:00, 14 August 2014 (UTC)[reply]

@Sage (Wiki Ed), I did some improvements to the tool which should appear in the future updates of the page:

It use only the new/replaced text instead of some surrounding text
Fixed truncated links
Reverts (with default summary) aren't considered

For broken links/non working links: it is possible to get some more details in internal (non public) interface of ithenticate, regarding the date in which the source page was crawled. Some common reasons for broken links:

truncated links
The source page isn't available anymore (the original site is down).
The content isn't publicly available and only users with access can get into (such as some scientific journals).

The current plan is to run the bot in schedules of 5 hours, about diffs from the last 6 hours (some diffs may be duplicate).

Eran (talk) 19:37, 14 August 2014 (UTC)[reply]

Eran: Exciting! No rush, but I'm interested in poking around the tool source when it's available. As Doc James might have told you, Wiki Ed Foundation is also very interested in this type of tool, although we're looking to do somewhat different things with it.--Sage (Wiki Ed) (talk) 19:47, 14 August 2014 (UTC)[reply]

Interesting, though we will go mad trying to keep up with it. Should/can reversions to previous versions be excluded? Obviously this is likely to throw up both acceptable stuff from old PD sources (EB and CE seem to be in todays) as well as mirror/rip off sites taking original WP content. Wiki CRUK John (talk) 10:07, 15 August 2014 (UTC)[reply]

It is picking up a good number of true positives. Yes we need to exclude reverts. This should be fairly easy to do. Doc James (talk · contribs · email) (if I write on your page reply on mine) 06:33, 23 August 2014 (UTC)[reply]

Done. Wiki CRUK John, thank you for the comment. Eran (talk) 07:50, 23 August 2014 (UTC)[reply]

Expansion to all of enwp

Do we have any sense of how large a workload this will be? At least initially the med-only scope and manual assessments took a substantial part of one editor's time (mine). I expect that we're looking at a very substantial multiplier on that if the tools aren't greatly advanced first. Phased scale-up of the scope would help, but we'll need more than a few people looking at the reports. Tools will need to automate flagging suspected sources: was there something in the addition that resembles a citation?; is the suspected source still on line?; was the matched string previously seen in wp? Are other page edits matching at the same suspected source? Which of multiple suspected sources has the longest match to the added text? (If it is a copyvio, this will be the source). When it is, will need to have one-click tools to notify the editor, undo the edit, comment the undo. When it is not, will need one click to revise the blacklist. Let's walk before attempting to win the marathon. LeadSongDog come howl! 05:38, 18 December 2014 (UTC)[reply]

We estimate that there are 3000 to 4000 edits that will be sent through API and that 375 will be flagged per day. I guess the question is what is happening to all the copyright violations we are currently not detecting? We have a number of people interested in volunteering. This will help drive improvements to the underlying software I hope. I would check more of them but you keep beating me to it Lead :-)Doc James (talk · contribs · email) 05:52, 18 December 2014 (UTC)[reply]

Some details

This set of guidelines is very helpful for getting a clearer idea of how this would work. I have a few further questions:

What will the user page disclosure say? What suggestions will it offer for somebody who has questions, concerns, or wants to pitch in?
Regarding point #6, which do you anticipate being more common -- will this user primarily engage with discussion, or will you?
What templates do you plan to leave on user talk pages? Are they existing warnings, etc. or will you be composing new/specialized ones?
How do you plan to measure the level of success at the end of the 4 month pilot? What will be significant information to you -- the rate of retention of the edits? The number of complaints/concerns raised by community members? The number of instructors who engage on behalf of their students? I'm just throwing these ideas out off the top of my head. What do you consider the important findings you'll have? -Pete (talk) 17:25, 18 December 2014 (UTC)[reply]
- It could say "This user is being paid by User:Doc James to follow up and fix copyright issues on Wikipedia. If you have questions, concerns or are interested in becoming involved please contact User:Doc James"
- The user will respond to verify that they have received and read the concern. Likely I will primarily engage.
- People usually do not get blocked the first time copyright issues are raised. We will want to look at how the message left on their talk page effects the chance of them plagiarizing again or continuing to edit Wikipedia.
- The continuity of this person's position will depend partly on if they can accurately differentiate between what is and isn't actually copyright infringement.
- It is always difficult to predict concerns before hand. It will depend not on the number of complaints but the substance of the complaints. One serious concern may end things. Doc James (talk · contribs · email) 19:44, 18 December 2014 (UTC)[reply]