Wikipedia:Bots/Requests for approval/ScannerBot

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard. The result of the discussion was

Approved.

ScannerBot

New to bots on Wikipedia? Read these primers!

Approval process – How this discussion works
Overview/Policy – What bots are/What they can (or can't) do
Dictionary – Explains bot-related jargon

Operator: 0xDeadbeef (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 01:48, Thursday, May 5, 2022 (UTC)

Function overview: Removes tracker tags in Twitter links.

Automatic, Supervised, or Manual: Automatic

Programming language(s): Python

Source code available: gist

Links to relevant discussions (where appropriate):

Edit period(s): One time run

Estimated number of pages affected: <3000 per this query

Namespace(s): Mainspace

Exclusion compliant (Yes/No): Yes

Function details: Finds twitter.com URLs and remove parameters named as s, t, or cxt.

Discussion

Comments before task change

Comment: if a bot account is needed, I will probably use ScannerBot. 0xDEADBEEF (T C) 01:51, 5 May 2022 (UTC)[reply]

Note: This bot appears to have edited since this BRFA was filed. Bots may not edit outside their own or their operator's userspace unless approved or approved for trial. AnomieBOT ⚡ 10:53, 5 May 2022 (UTC) — AnomieBOT (talk • contribs) has made few or no other edits outside this topic. [reply]
Note: This bot has edited its own BRFA page. Bot policy states that the bot account is only for edits on approved tasks or trials approved by BAG; the operator must log into their normal account to make any non-bot edits. AnomieBOT⚡ 11:40, 5 May 2022 (UTC)[reply]
- 0xDEADBEEF (T C) 11:43, 5 May 2022 (UTC)[reply]
I'm not entirely sure how much I want to be commenting with my BAG hat on, but based on previous tasks that were approved I am not convinced that as a bot task this is fully formed yet. Based on the supposed list of URLs where this tracking is located, the scanner isn't working right either, because there are a few false positives that I know exist out there that are not on the list. If 0xDeadbeef wants to use JWB on their main account they are welcome to and do not require BAG approval. On that note, though, I have moved this BRFA to the bot's page to make it officially a BRFA. Primefac (talk) 14:41, 7 May 2022 (UTC)[reply]
And, on a minor note, this has prompted me to run Task 17 again... Primefac (talk) 14:49, 7 May 2022 (UTC)[reply]

I didn't have a method for determining that they are actually parameters of an URL. I tested with a python script that just matched on keywords within the source. I didn't know that there were previous tasks. I will take a look at those and perhaps amend the regex to match more parameters. 0xDEADBEEF (T C) 02:30, 8 May 2022 (UTC)[reply]
\??(?:&?(?:fbclid|yclid|tracking_referrer|referrer(?:_access_token)?|gs_l|dclid|_ga|_gl|fb_(?:source|ref)|ref_)=[^&\s\]\|]*?)+(?=<|}|]|\s|\|)|(?<=\?)(?:&?(?:fbclid|yclid|tracking_referrer|referrer(?:_access_token)?|gs_l|dclid|_ga|_gl|fb_(?:source|ref)|ref_)=[^&\s\]\|]*)+&|(?<=&)(?:&?(?:fbclid|yclid|tracking_referrer|referrer(?:_access_token)?|gs_l|dclid|_ga|_gl|fb_(?:source|ref)|ref_)=[^&\s\]\|]*)+& 0xDEADBEEF (T C) 02:40, 8 May 2022 (UTC)[reply]

Based on the supposed list of URLs where this tracking is located, the scanner isn't working right either: For the record: I didn't know that CirrusSearch allowed regex searching so I used pywikibot. Now I will probably use insource:/.../ to generate list of articles to fix, with JWB. 0xDEADBEEF (T C) 04:06, 8 May 2022 (UTC)[reply]

Note: The functionality and the scope of the bot was made more specific. See page history for more details. 0xDeadbeef (T C) 06:28, 14 May 2022 (UTC)[reply]

Regex? Primefac (talk) 15:13, 14 May 2022 (UTC)[reply]

@Primefac: You can look at the gist I linked. https://twitter\.com/\w+/status/\d+\?[^\s}<|]+ is used to match the URL, and then urllib is used to parse, and then remove the parameters. 0xDeadbeef (T C) 15:19, 14 May 2022 (UTC)[reply]

You'll likely want https:\/\/twitter\.com\/\w+\/status\/\d+\?[^\s}<|]+ for regex, to escape the / characters. (Same for below). Headbomb {t · c · p · b} 01:13, 17 May 2022 (UTC)[reply]

I embedded the regex as a Python raw string which does not need to escape forward slashes. 0xDeadbeef (T C) 01:17, 17 May 2022 (UTC)[reply]

But dots still need escaping? Headbomb {t · c · p · b} 01:56, 17 May 2022 (UTC)[reply]

Yes because . and \. have different meanings in regex. 0xDeadbeef (T C) 02:30, 17 May 2022 (UTC)[reply]

I know. Just surprised one needs escaping and the other doesn't. Not important, if the code works, it works. Headbomb {t · c · p · b} 10:24, 17 May 2022 (UTC)[reply]

@Headbomb, for what it's worth, I believe it's because some non-python RegEx is enclosed in / . . . /, so / needs to be escaped, but in python RegEx is just given as a string ' . . . ' ― Qwerfjkl talk 14:22, 29 May 2022 (UTC)[reply]

You'll want to detect primary URLs, or skip archive URLs, changing those will break them. Archive URLs can be 20+ types, it's probably easiest to detect if the twitter URL starts with "/" (example in Brandon Clarke). -- GreenC 16:15, 14 May 2022 (UTC)[reply]

Yeah, I should probably match [^/] or [\s=>] for it to be primary. 0xDeadbeef (T C) 02:07, 15 May 2022 (UTC)[reply]

Great, thanks. Also WebCite like https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com .. couple others use ?url= vs. "/" as the break point. -- GreenC 03:12, 15 May 2022 (UTC)[reply]

@GreenC: Hmm, then it would be hard to distinguish a template parameter from a URL parameter in an URL...

{{Foo|1=https://twitter.com}}

https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com 0xDeadbeef (T C) 04:03, 15 May 2022 (UTC)[reply]

Right, I can't say what the regex would be. One method is match every string "/https?://twitter" and convert to "__hidestring__" (same with "?url=") - and when done convert those hidden strings back before saving the article. The "__hidestring__" might be "__hidestring-fs-http__" or "__hidestring-fs-https__" so you know how to revert back. Or really best, save the literal string in a table and the hidden string is the table identifier so it be restored. That way it can match on "/https?://(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])[.])*twitter" which will capture all hostname(s) such as "/http://beta.twitter" -- GreenC 17:33, 15 May 2022 (UTC)[reply]

Okay I used a negative lookbehind and you can look at the tests here: https://regexr.com/6lmgl 0xDeadbeef (T C) 23:18, 15 May 2022 (UTC)[reply]

(?<!\?url=|/|cache:)https://twitter\.com/\w+/status/\d+/?\?[^\s}<|]+ 0xDeadbeef (T C) 04:25, 16 May 2022 (UTC)[reply]

Nice. There is also sometimes very rarely protocol relative (WP:PRURL) eg. {{cite web |url=//twitter.com}}. They are so uncommon and can be tricky it would probably be OK to skip or log them if it doesn't fit with the regex. -- GreenC 05:21, 16 May 2022 (UTC)[reply]

a quick search seems to show that it is fine. I've fixed all three that appeared from that search. 0xDeadbeef (T C) 06:52, 16 May 2022 (UTC)[reply]

Note: number of pages affected has been lowered following a quick search with insource:. 0xDeadbeef (T C) 04:23, 21 May 2022 (UTC)[reply]

{{BAG assistance needed}} Requesting BAG assistance due to stale BRFA. 0xDeadbeef (T C) 05:08, 27 May 2022 (UTC)[reply]

To be clear: This BRFA has been inactive for some time. Primefac told me that they wanted input from other BAG members first. I would like to know if this is declined or approved for trial. Thanks. 0xDeadbeef 07:43, 28 May 2022 (UTC)[reply]

Looks fine to me for trial. All issues raised above appear addressed anyway. -- GreenC 19:05, 29 May 2022 (UTC)[reply]

Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Let's give it a try. — The Earwig (talk) 21:18, 30 May 2022 (UTC)[reply]

Trial complete. [1] 0xDeadbeef 04:57, 31 May 2022 (UTC)[reply]

Deadbeef, checked one edit and noticed the Wayback link actually works with the tracker removed. Who knew. After all that above :) Wayback magic. But can't say this holds true for every link, it's the kind of thing would have to verify with a header check on the Wayback link with tracking removed. It would be like an added feature to the bot, only if you wanted to try. - GreenC 06:18, 31 May 2022 (UTC)[reply]

So I tried querying the wayback machine api to fix archive.org URLs: [2] Looking at the preview of the bot's edits, it looks fine. Perhaps it needs an extended trial? 0xDeadbeef 08:01, 31 May 2022 (UTC)[reply]

(@The Earwig) 0xDeadbeef 11:52, 4 June 2022 (UTC)[reply]

That's great, as it checks there is a copy in the API, it should be good to go. - GreenC 15:35, 4 June 2022 (UTC)[reply]

{{BAG assistance needed}} 0xDeadbeef 05:28, 12 June 2022 (UTC)[reply]

Approved. @0xDeadbeef: Thanks for your patience. Edits look good. I am fine with the expanded functionality for Wayback links and don't see a need for an extra trial provided you monitor these changes. — The Earwig (talk) 02:35, 13 June 2022 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard.