Wikipedia:Bots/Requests for approval/KolbertBot 4
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: Jon Kolbert (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 12:28, Tuesday, November 20, 2018 (UTC)
Function overview: Removing the Facebook tracking/referral string appended on external links
Automatic, Supervised, or Manual: Automatic
Programming language(s): Python / AWB
Source code available: Using standard pywikipedia
Links to relevant discussions (where appropriate): I've started a discussion at the VP, although I'm sure there is a general consensus to strip referrer data when possible.
Edit period(s): Continuous
Estimated number of pages affected: Initial run, a few thousand. Maybe only 100-200 a day after that.
Namespace(s): Mainspace, file,
Exclusion compliant (Yes/No): No, don't see a particular reason to make this exclusion compliant, if one is presented I'm willing to make it so.
Function details: Simply removing the appended string from external links, nothing less, nothing more.
Discussion
[edit]- Approved for trial (25 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. go ahead and trial, list your results here when done. — xaosflux Talk 12:34, 20 November 2018 (UTC)[reply]
- Trial complete. Results. A look at the diffs seems to show everything is in working order and the bot is functioning as intended. Jon Kolbert (talk) 12:45, 20 November 2018 (UTC)[reply]
- Anytime we make changes to URLs, the archiveurl has to be taken into account as changing the archiveurl will break it thus simple search/replace won't work. The easiest way I have found to avoid this problem is to search on
([/]|[?]url[=])?https?<restof url>
then check the match for^([/]|[?]url[=])
and if it exists leave that URL alone. In this case I doubt there will be many but it should be checked. This is not an ideal solution as it can result in the|url=
and|archiveurl=
being different but at least it won't break things. To do it properly requires a more sophisticated bot that can work with templates at the argument level. In this case I would just log cases of archive url matches then manually fix them if not too many. -- GreenC 19:18, 20 November 2018 (UTC)[reply]- This appended string seems to have just been added in October 2018, I feel as if we handle this proactively we should not encounter many archive URL situations as the string will likely be removed before any archiving. Jon Kolbert (talk) 19:39, 20 November 2018 (UTC)[reply]
- @GreenC: Just tried it on a page with an archiveurl, removing the string from both didn't break anything. Jon Kolbert (talk) 19:43, 20 November 2018 (UTC)[reply]
- Well it's only working accidentally. The version of the URL with the tracking code is a different snapshot then the version without - it was just fortunate that both were archived. That can't be assumed or guaranteed. -- GreenC 19:57, 20 November 2018 (UTC)[reply]
- No, but there's a high chance that links less than a month old will still be active, nearly eliminating the possibility of a dead link. Not completely, but with a high degree of certainty. Jon Kolbert (talk) 21:05, 20 November 2018 (UTC)[reply]
- Right editors are proactively putting the archiveurl in place for the day the main link goes dead. But a broken archiveurl is still broken. When the main URL goes dead years later, the archive won't work. -- GreenC 21:21, 20 November 2018 (UTC)[reply]
- It appears as if the "base URL" is archived before any queries, for example, https://web.archive.org/web/20181028213236/https://variety.com/2018/film/box-office/suspiria-top-screen-average-2018-1203006613/ was archived a day before https://web.archive.org/web/20181029003954/https://variety.com/2018/film/box-office/suspiria-top-screen-average-2018-1203006613/?fbclid=IwAR01YiHprhT-j2bwzeULa4Ulhnx61voixSNUATkdXZFzl1encP7feZEIupA. Are you able to provide an example where a base URL wasn't archived, but a URL with a query attached was? The modification using the regex did not produce a broken archiveurl, as eluded to. Jon Kolbert (talk) 21:43, 20 November 2018 (UTC)[reply]
- There are two archives because the URLs were found at separate places, such as on a different Wikipedia page or somewhere else on the Internet that didn't have the fbclid. To answer your question, from Foreigner (band): https://web.archive.org/web/*/https://www.broadwayworld.com/bwwmusic/article/Foreigner-Announces-Then-and-Now-Concerts-With-All-Original-And-Current-Members-20180806?fbclid=IwAR04AHlL-H1JRbecxnP6t5M52O57sYW3TwWyQO94om0e1xgwoQTPQg6WMNc with https://web.archive.org/web/*/https://www.broadwayworld.com/bwwmusic/article/Foreigner-Announces-Then-and-Now-Concerts-With-All-Original-And-Current-Members-20180806 .. the former works the later doesn't. Another solution: whenever you come across an archive url make sure to trigger a 'save page now' for the updated URL so to ensure the archiveurl is not broken. For example in this case issue a GET with https://web.archive.org/save/https://www.broadwayworld.com/bwwmusic/article/Foreigner-Announces-Then-and-Now-Concerts-With-All-Original-And-Current-Members-20180806 and that will create the archive. -- GreenC 22:18, 20 November 2018 (UTC)[reply]
- It appears as if the "base URL" is archived before any queries, for example, https://web.archive.org/web/20181028213236/https://variety.com/2018/film/box-office/suspiria-top-screen-average-2018-1203006613/ was archived a day before https://web.archive.org/web/20181029003954/https://variety.com/2018/film/box-office/suspiria-top-screen-average-2018-1203006613/?fbclid=IwAR01YiHprhT-j2bwzeULa4Ulhnx61voixSNUATkdXZFzl1encP7feZEIupA. Are you able to provide an example where a base URL wasn't archived, but a URL with a query attached was? The modification using the regex did not produce a broken archiveurl, as eluded to. Jon Kolbert (talk) 21:43, 20 November 2018 (UTC)[reply]
- Right editors are proactively putting the archiveurl in place for the day the main link goes dead. But a broken archiveurl is still broken. When the main URL goes dead years later, the archive won't work. -- GreenC 21:21, 20 November 2018 (UTC)[reply]
- No, but there's a high chance that links less than a month old will still be active, nearly eliminating the possibility of a dead link. Not completely, but with a high degree of certainty. Jon Kolbert (talk) 21:05, 20 November 2018 (UTC)[reply]
- Well it's only working accidentally. The version of the URL with the tracking code is a different snapshot then the version without - it was just fortunate that both were archived. That can't be assumed or guaranteed. -- GreenC 19:57, 20 November 2018 (UTC)[reply]
- @GreenC: Just tried it on a page with an archiveurl, removing the string from both didn't break anything. Jon Kolbert (talk) 19:43, 20 November 2018 (UTC)[reply]
- This appended string seems to have just been added in October 2018, I feel as if we handle this proactively we should not encounter many archive URL situations as the string will likely be removed before any archiving. Jon Kolbert (talk) 19:39, 20 November 2018 (UTC)[reply]
- @Primefac: how are you handling this with PrimeBOT 17 today? (also @Jon Kolbert: it won't hurt to have multiple bots doing this task - but if PrimeBot wants to do this along with its other tracking removals do you still also want to do it?) — xaosflux Talk 14:59, 21 November 2018 (UTC)[reply]
- @Xaosflux: Either way works for me, I'd be willing to expand the task on KolbertBot at a later date to cover some of the YouTube referral data as well. I'm pretty sure I have that regex saved somewhere. Jon Kolbert (talk) 15:08, 21 November 2018 (UTC)[reply]
- I haven't done anything today with the bot other than add in a request to have this tracking param added to the expanded list of terms. Primefac (talk) 21:56, 21 November 2018 (UTC)[reply]
- @Xaosflux: Either way works for me, I'd be willing to expand the task on KolbertBot at a later date to cover some of the YouTube referral data as well. I'm pretty sure I have that regex saved somewhere. Jon Kolbert (talk) 15:08, 21 November 2018 (UTC)[reply]
{{BAG assistance needed}} Any updates on this? The village pump discussion looks dead, and there seems to be a consensus that these changes are desired. I've done some semi-automatic edits removing the referral data so that archive links don't pick them up as well, but it doesn't appear as if there is an issue on the regex. From the VP discussion, I think we have determined that the removal of referral data that doesn't change the page content. For example, I've added some regex designed to remove referral data from YouTube links. What I would like to determine is whether this BRFA or that of PrimeBot's can be considered as a determined consensus to remove referral data from external links. Jon Kolbert (talk) 18:19, 28 November 2018 (UTC)[reply]
- I've unarchived the discussion for you. SQLQuery me! 23:17, 3 December 2018 (UTC)[reply]
- @Jon Kolbert and Primefac: FYI: Special:PermaLink/872353182#Removing_Facebook_tracking_parameter_from_external_links is closed as showing support. — xaosflux Talk 20:35, 6 December 2018 (UTC)[reply]
- @SQL: I've closed the RfC. — xaosflux Talk 20:36, 6 December 2018 (UTC)[reply]
- @Xaosflux: Great, thank you. Should I consider this task approved? Jon Kolbert (talk) 00:31, 7 December 2018 (UTC)[reply]
- {{BAGAssistanceNeeded}} (since I closed the RfC, hoping someone else on BAG can review this for final approval). — xaosflux Talk 02:41, 7 December 2018 (UTC)[reply]
- Approved. SQLQuery me! 04:52, 7 December 2018 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.