Wikipedia:Bots/Requests for approval/OAbot 3

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Approved.

OAbot 3

Operator: Nemo_bis (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search) for this task; Pintoch (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search) as main owner and author of the bot

Time filed: 13:52, Thursday, July 25, 2019 (UTC)

Function overview: Add and maintain supported identifiers to citation templates (mostly {{cite journal}}), including related metadata such as access level but excluding the |url= parameter.

Automatic: A queue of edits is created automatically (manually triggered), then a cursory review of its contents is performed manually to exclude anomalies, then select items are moved to a queue for the bot to perform them automatically. Edits are then sampled for manual checks and some manual fixes are performed by the operators in the few hours or days following a bot run on the pages which ended up on Category:CS1 maintenance (typically less than one in a thousand).

Programming language(s): Python

Source code available: https://github.com/dissemin/oabot / phabricator:tag/oabot/ (relying on https://github.com/dissemin/dissemin/ and https://github.com/Impactstory/oadoi )

Links to relevant discussions (where appropriate): Wikipedia talk:OABOT, Help_talk:Citation_Style_1#RfC_on_linking_title_to_PMC and similar for the desirability of identifiers and precise information on them.

Edit period(s): Once every few weeks or months.

Estimated number of pages affected: Less than 20k for the first steps; more than 300k overall considering all articles with DOIs.

Namespace(s): 0

Exclusion compliant (Yes/No): Yes

Adminbot (Yes/No): No

Function details: Following the success of task OAbot 2, we're proposing to extend the functionality of the bot to all identifiers. The addition of arxiv and PMC identifiers (about 25k edits) has been a success: it has encountered few mistakes and the bot has been made more robust in response (for instance we are now stricter in matching publications).

The first step will be to add |hdl= identifiers and |hdl-access= status on about 2k articles. Those handles typically point to an institutional repository like https://ntrs.nasa.gov/ or https://deepblue.lib.umich.edu/ (the most common in the queue is https://quod.lib.umich.edu/ for now). Citation bot is also able to add such identifiers, but does so more slowly and does not (yet) set access status, while we now do (T228632): example edit [1].

After this is done, other identifiers will be handled depending on demand and volumes. The most consequential work will be to eventually add |doi-access=free to all relevant citations (an estimated 200k DOIs): this functionality was part of the original request (and not challenged by anybody) but later dropped when the bot became a user-triggered tool, as the number of required edits is incompatible with human editing.

Expected improvements in the new future, if this task is approved, include:

maintenance of existing identifiers, e.g. to remove or report on broken identifiers (e.g. CiteSeerX records which may have been taken down);
avoiding more publisher URLs even in manual mode, instead add DOIs or DOI access data where relevant (to avoid creating more work for Citation bot, which removes redundant URLs);
give the community full prior control on what identifiers are added by the bot, by adding a subpage to Wikipedia:OABOT where users would be able to blacklist individual URLs (and therefore identifiers) they consider undesirable for whatever reasons, including suspected errors in open access repositories (mismatch between record and DOI, files mistakenly open for download etc.), even if such cases are a minuscule minority.

Discussion

This might not be relevant yet, but I take that the bot won't add identifiers without some kind of procedure to reject unsuitable identifiers? This bot has had some copyright issues in the past. It also won't replace already existing URLs? Because that might be problematic under WP:SAYWHEREYOUGOTIT. Jo-Jo Eumerus (talk, contributions) 16:07, 25 July 2019 (UTC)[reply]

The current procedure to reject unwanted identifiers is to either blacklist the bot on the specific page with {{bots}} or comment out the identifier in the specific citation template. The proposed additional procedure is to let any user blacklist an identifier by means of linking it on a central subpage, so that it's no longer added to any other page: this will allow users to reject one, ten or a thousand identifiers with a single edit and have the community decide it by consensus.

This task proposes that no edits are made to the |url= parameter at all using the bot account. I'll note however that WP:SAYWHERE specifically states that «You do not have to specify how you obtained and read it. So long as you are confident that you read a true and accurate copy, it does not matter [...]». Nemo 16:24, 25 July 2019 (UTC)[reply]

I have no objection to adding hdl identifiers. But I am currently seeing huge numbers of OABot edits on my watchlist, making it difficult to find any other changes and impossible to manually check them for accuracy, and would be interested in knowing whether there are any plans for throttling the bot to a more reasonable rate of updates. Also, if the "other identifiers" to be added are to be included in this BRFA, they need to be specified explicitly. For instance, I would be opposed to automatically adding citeseerx identifiers automatically, for all the previously-discussed reasons, and wouldn't want this BRFA to be taken as sidestepping that discussion. —David Eppstein (talk) 17:57, 25 July 2019 (UTC)[reply]

On the first point, I agree we need a frank conversation on the scope of the task; I just suggest to avoid having the same conversation over and over for each new identifier. On the second, as far as I can see the bot has respected the typical rate limit of 12 edits per minute, but it would not be a problem to reduce the speed. Nemo 20:06, 25 July 2019 (UTC)[reply]

I support this task, but I'll let someone else from the BAG to do a review here. I'll note here that WP:BOTREQUIRE suggests 6 EPM for non-urgent tasks however. Headbomb {t · c · p · b} 14:10, 30 July 2019 (UTC)[reply]

One thing I would like to see is that zenodo support is added to CS1 templates. Headbomb {t · c · p · b} 14:13, 30 July 2019 (UTC)[reply]

Actually, Zenodo links were the reason why I did ask whether the bot won't add identifiers without some kind of procedure to reject unsuitable identifiers as we've had copyright problems and disputes about them. I am not sure if the problem was resolved, though. Jo-Jo Eumerus (talk, contributions) 14:20, 30 July 2019 (UTC)[reply]

This request, sic stantibus rebus, would not produce any addition of links to Zenodo, as there is no identifier for it. As for the existing identifier parameters, which evidently were added because the target websites are considered good resources rather than systematic copyright infringement rackets, the proposal is the blacklist of specific URLs above. The discussions you linked were often focused on hypothetical or apodictic statements, impossible to discuss constructively; if users instead can focus on explaining which URLs are bad for which reasons, a consensus will be easier to find. Nemo 19:56, 30 July 2019 (UTC)[reply]

I pushed a change to reduce the editing speed. Nemo 19:56, 30 July 2019 (UTC)[reply]

Back to the CiteSeerX issue, to rekindle the discussion: in my opinion it falls squarely under Wikipedia:Copyrights#Linking_to_copyrighted_works "It is currently acceptable to link to internet archives such as the Wayback Machine, which host unmodified archived copies of webpages taken at various points in time" for the cached PDFs, while the rest of the functions (citation graphs etc.) are uncontroversially helpful and unproblematic. Therefore the current policies support an automatic addition and we should only handle the rare exceptions where a link would be problematic: a blacklist is a possible technical solution, but we could consider other ideas. Nemo 17:08, 24 August 2019 (UTC)[reply]

Sounds great, thanks for the update. – SJ + 18:38, 16 August 2019 (UTC)[reply]

If this is purely about adding already-supported identifiers (sans CiteSeerX) and converting existing URL to identifiers (e.g. |url=http://citeseerx.ist.psu.edu/viewdoc/summary?doi=... → |citeseerx=... is fine), then I see little that is objectionable. So let's see a trial at least, with an explicit list of identifiers covered, and we'll have a better idea what's in store. Approved for trial (10 edits per identifier). Please provide a link to the relevant contributions and/or diffs when the trial is complete.. Headbomb {t · c · p · b} 17:59, 22 September 2019 (UTC)[reply]
- Trial complete. [2] with a syntax error which I've now fixed. I'm requesting another trial to test the correction. Nemo 08:27, 25 September 2019 (UTC)[reply]
  - Approved for extended trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete., again 10 per identifier. Are you planning on only adding handle links with this one? Headbomb {t · c · p · b} 13:08, 25 September 2019 (UTC)[reply]
    - Thanks. I don't have much to work on apart from Handle and CiteSeerX identifiers at the moment. If the bot is approved, I'll write the code to add doi-access=free, which would make up the bulk of the edits, and any other identifiers for which demand happened to arise in the future. For instance, biorxiv.org content does not seem to be in big demand right now, but that might change in the future (there's plenty!), in which case adding it will be trivial. Nemo 19:49, 25 September 2019 (UTC)[reply]
    - Trial complete. [3] looks ok to me. Nemo 20:23, 26 September 2019 (UTC)[reply]
    - Trial complete. also for the doi-access=free addition. [4] Nemo 16:22, 27 September 2019 (UTC)[reply]

This sounds an excellent idea. I wish it well. Support (if that's appropriate). Chiswick Chap (talk) 13:33, 28 September 2019 (UTC)[reply]

@Nemo bis: in [5], for example, it correctly adds one |doi-access=, but it misses several more [6]. Any way to catch/report those? Headbomb {t · c · p · b} 15:14, 28 September 2019 (UTC)[reply]
- Indeed in this run it was only editing one template per page, even when it identified more. I was just writing a patch to make sure it did everything in a single edit rather than multiple passes. Nemo 17:25, 28 September 2019 (UTC)[reply]
  - Then Approved for extended trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete., 25 edits, with the 'do everything it can at once' logic. Or however many you need to have at least 5 instances of multiple doi/hdl added/flagged in the same edit. Headbomb {t · c · p · b} 18:09, 28 September 2019 (UTC)[reply]
    - Trial complete. 25 larger edits with doi-access, hdl, hdl-access. If there are still doubts about this I'd suggest to do a larger trial run, for instance 5000 edits, so we can collect more feedback. Nemo 14:17, 30 September 2019 (UTC)[reply]
I don't want to be pushy with BAG members already busy on other pages, but maybe one of @TheSandDoctor, SQL, and MBisanz: has suggestions on how to proceed? Nemo 12:22, 1 October 2019 (UTC)[reply]
Approved. After having a read of this and given lack of opposition after 12 days, as well as the intended edits being made look as if they are being correctly made, I'm approving this.—CYBERPOWER (Chat) 20:11, 11 October 2019 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.