Jump to content

Wikipedia talk:Link rot/Archive 5

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
[edit]

Some sources can't be archived properly due to redirecting and therefore prone to link rot. Is there any tags that can be used to mark such citations so they could be ideally replaced with a more archive-friendly link before they become dead links?Hariboneagle927 (talk) 05:28, 21 January 2022 (UTC)[reply]

How to archive YouTube videos on Wikipedia articles

[edit]

There are several articles on Wikipedia which use YouTube videos as a reference. Since YouTube videos get blocked and/ or deleted frequently, I tried to archive some YouTube videos using Wayback Machine and archive.today but Wayback doesn't let me view the video on the archived webpage and archive.today fails to archive a youtube video every time. Is there any other way to archive youtube videos for preventing link rot on wikipedia articles? — Preceding unsigned comment added by Tech2009Girl (talkcontribs) 10:56, 29 January 2022 (UTC)[reply]

@Tech2009Girl: If Wayback doesn't work try https://ghostarchive.org Rlink2 (talk) 14:15, 29 January 2022 (UTC)[reply]
"Wayback doesn't let me view the video" can you give an example? -- GreenC 16:34, 29 January 2022 (UTC)[reply]
@Rlink2: Thank you for the help -- User:Tech2009Girl
@GreenC: "Doesn't let me view" means that I'm unable to play the video. I'm not saying that the video doesn't even appear. -- User:Tech2009Girl
User:Tech2009Girl, can you provide an example? I can report it to IA and they will try to fix. It's a fairly new thing and they need feeback, example links not working. -- GreenC 16:50, 30 January 2022 (UTC)[reply]

Alexa.com

[edit]

Hello, there is a note on alexa.com that it will be retiring on 1 May 2022 See https://support.alexa.com/hc/en-us/articles/4410503838999. We currently have just over 800 links to the site. Keith D (talk) 12:45, 12 March 2022 (UTC)[reply]

RfC: Wikipedia:Reliable_sources/Noticeboard#RfC:_Alexa_Internet -- GreenC 15:32, 12 March 2022 (UTC)[reply]

Finding all articles linking to a dead site?

[edit]

I found a bunch of references to one dead site, but after poking around, I've found out that all the content is still there, just under a slightly tweaked website name. It's even retained the exact same URL structure as before, it's literally just the precise details of the website's name that's changed -- update that, and the links spring back to life. Is there any way to collect the articles which still cite the old URL, so I can correct them en masse using AWB? Searching the regular way shows me there are about 2.6k articles still using it (although some may have valid archives, which I'd leave untouched), but no easy way to convert that into one grand list for inputting into AWB. Buttons to Push Buttons (talk | contribs) 17:15, 18 May 2022 (UTC)[reply]

Does advanced search using insource: help in some way? For example, something like this. --Kompik (talk) 21:45, 6 January 2023 (UTC)[reply]
[edit]

I am not sure what would be the right place to ask about this - I have tried here. As far as I can tell, CiteSeerX changed the scheme they're using. For example, Template:CiteSeerX offers 10.1.1.34.2426 as an example. The link is dead, but looking in the Wayback Machine, we can see that it was the paper William D. Harvey , Matthew L. Ginsberg: Limited Discrepancy Search. In the new scheme, pid/efa56b710ff3c6d8b2666971d07c311eeb6c5b40 or pid/d8b76a9af36448b775997ef0a960e4b0fa585beb seem like the most likely candidate. Is there any chance to fix the old links in some other way than checking them one-by-one and replacing them by the new links manually? Is somebody aware of some announcement from CiteSeerX containing some details about the old identifiers and the new ones? --Kompik (talk) 10:27, 6 January 2023 (UTC)[reply]

[edit]

At 2019 Military World Games, clicking the archive link in the following citation loads a blank archive.org page. Bad snapshot maybe?

<ref>{{Cite web |url=https://results.wuhan2019mwg.cn/index.htm#/organisation |title=Archived copy |access-date=2019-10-19 |archive-url=https://web.archive.org/web/20191028234735/https://results.wuhan2019mwg.cn/index.htm#/organisation |archive-date=2019-10-28 |url-status=dead }}</ref>

What should be done? Is there a way to repair this? Should it be deleted? Marked with something? Thanks. –Novem Linguae (talk) 21:54, 4 April 2023 (UTC)[reply]

@Novem Linguae theres a hashtag in the link, wayback machine does not support hashtags. Notrealname1234 (talk) 00:31, 29 May 2023 (UTC)[reply]
The only way (i think) to repair this is to put the link in a another web archiving service (like archive.is) Notrealname1234 (talk) 00:32, 29 May 2023 (UTC)[reply]

Archiving hundreds of healthy (live) sources

[edit]

After asking at the wrong place, I ask here: is there any need for this type of edit? Shouldn't we just archive dead or unfit sources? Unlike the former, this edit actually makes sense because it did rescue sources. What I usually do is to rescue sources manually, this prevents outdated archives, i.e. archived pages that present (very) old information compared to live sources (e.g. a page showing information from 2023 and another showing information from 2015.) SLBedit (talk) 16:13, 7 June 2023 (UTC)[reply]

Archiving at the time of usage guarantees (in most cases) that we have an archive of the source from when it was used and cited. Once it's been archived once, Archive.org will probably continue to archive it. The worst-case scenario is a source that is used but not archived, and then disappears before it can be archived. Mackensen (talk) 16:44, 7 June 2023 (UTC)[reply]
The worst case scenario I've seen is where the earliest archive of a source postdates the URL being redirected / usurped. Then there's the appearance of a source with an archive, when in fact there is neither.
I ran IABot on Monica Macovei yesterday, after spending hours manually finding archives for dead links that had been damaged in script-assisted editing, just to make sure I had found them all. It tagged zero as dead, so it's the same category edit as the one User:SLBedit linked above. On articles with hundreds of references to web sources, it's tedious to go through and check each manually, and there's no guarantee Internet Archive bot will find any dead URLs, but I always instruct it to archive sources just in case there's no archive yet. Folly Mox (talk) 16:55, 7 June 2023 (UTC)[reply]
  • WaybackMachine (archive.org) is dynamic, not a static database. Archives move and disappear, for many reasons. Because of this I wrote a bot called WP:WAYBACKMEDIC. It's verifying archive URLs still work. Example: Special:Diff/952770913/972737706 The problem with this bot it's very resource intensive (on the WaybackMachine) so it runs slow and is semi-manual, thus I don't run it very often. My opinion is archives should only be added into Wikipedia when the link is dead. In terms of adding archives into the WaybackMachine, that is already done automatically by a back-end process (not IABot). A script monitors every edit on all 300+ language wikis (including Enwiki) and when it detects a URL it makes sure this URL is added to the WaybackMachine. Nothing needs to be done with this part of the process it's being taken care of already (mostly) by invisible bots. It's a massive load on WaybackMachine disk space and bandwidth, it is being donated free of charge to Wikipedia by the Internet Archive. -- GreenC 20:06, 7 June 2023 (UTC)[reply]

Worst-case scenario in adding archives automatically: A source is archived several times throughout the years; in 2000, 2010, and 2020. A user adds that source to an article in 2023 and a bot adds a link to the 2000 archive. Source becomes dead in 2024 and the article points the reader to an outdated source. Conclusion: After an automatic archive, someone needs to make sure the article links to the proper archive, as there may be different versions of the original source. SLBedit (talk) 21:08, 8 June 2023 (UTC)[reply]

In my experience, the archival bots usually add the most recent archive version, which is actually frequently less reliable than earlier archives, since in many cases the target site will have restructured over the years and the newer archives point to empty content.
The ideal situation would probably be if bots could read the access-date= parameter and add the most recent archived version that does not postdate the access-date. Folly Mox (talk) 21:38, 8 June 2023 (UTC)[reply]
My understanding is that bots do read the access-date and do as you suggest, although I'm not entirely certain of that. Dhtwiki (talk) 23:17, 11 June 2023 (UTC)[reply]
I'm someone who reverts the addition of archive links en masse, especially when the byte count is high (>10k), when the original links are live, and when the editor has shown no interest in curating the page otherwise; and I receive plenty of questions and push back from even the most experienced editors. GreenC has explained why adding such links isn't actually making the archiving happen. When IABot runs on its own, it only adds links to archives *when the original is determined to have died*. That is sensible, and I don't know why editors shouldn't be encouraged to check that option (assuming that it is an available option). Also, when links die, it's a good time to check for website reorganization, and reset the original link, as well as check for citation relevance (e.g. has scientific, economic, or census data been superseded?). I keep thinking that there should be an RfC on limiting the addition of archive links, but, as I said, I've received too much push back, from editors thinking that they're doing good, to think that such an RfC would easily pass. Dhtwiki (talk) 23:29, 11 June 2023 (UTC) (edited 00:48, 22 September 2023 (UTC))[reply]
[edit]

I frequently find myself manually finding archived content from a citation in order to verify a claim, but I don't always have time to update the citation accordingly to aid future readers. Is there a tool that automates adding the necessary three(?) parameters to a citation if I already have the archive link in hand? I would prefer this over using a bot, for the reasons given above. Orange Suede Sofa (talk) 00:20, 8 August 2023 (UTC)[reply]

I think that activating the WP:IABOT would do what you are looking for. According to the documentation, you can run the bot on demand, but a low-effort approach would be to mark a link as dead, as the highest priority task appears to be to look for dead-tagged links and replace them with archived links. There are also instructions on activating the bot here and here. Personally, I prefer to do the replacement manually as I try to take into account the url access date when selecting a particular archived version to use as the accessible URL, but that is certainly not a necessary thing and not for someone who wants to get 'er done quickly. --User:Ceyockey (talk to me) 00:43, 8 August 2023 (UTC)[reply]
To your point about preferring the manual approach, that's exactly what I'm trying to account for— I've usually already done the work of selecting the appropriate archive URL, now I just need the citation updated. If I understand the docs right, running the bot wholesale does everything from scratch; I'm looking for a way to automate the step of adding the appropriate links to the citation if I already have the archive link. In short, I've done the first part manually (finding the best archive URL) and want to automate the second half (updating the citation). Orange Suede Sofa (talk) 00:58, 8 August 2023 (UTC)[reply]
Yes, the bot does everything from scratch. Hmm -- you might try Wikipedia:AutoWikiBrowser. I have not used it for this particular use case, but it might be applicable. This would be a "semi-manual" approach, but you could quickly run through dozens of edits faster than via the main editing interface, I think. Take a look and see what you think. One tricksy thing - you will need to create a bot account with a bot password for yourself; the documentation notes this, but it's not obvious on a quick read, I think. --User:Ceyockey (talk to me) 01:24, 8 August 2023 (UTC)[reply]
Thanks; I didn't think of that. I've had AWB perms in the past, so maybe once I have a plan I can whip up something quick and apply. Regards, Orange Suede Sofa (talk) 01:35, 8 August 2023 (UTC)[reply]
I use AutoHotKey. You can set it up so that whenever you type the letters "3archive", it will automatically replace it with "|archive-url= |archive-date= |url-status". I also use it to generate empty cite webs, books, journals .. various things like that, saves a lot of typing. -- GreenC 01:49, 8 August 2023 (UTC)[reply]
Client-side scripting is a good approach too; I don't have Windows so I can't use AutoHotKey, but that page has given me enough pointers to go dig around. Thanks! Orange Suede Sofa (talk) 01:59, 8 August 2023 (UTC)[reply]

To wrap up the question for myself, and in case this helps anyone else, I created an AppleScript to take a Wayback URL from the clipboard, parse the URL for the date, and then automatically add the |archive-URL= and |archive-date= parameters to an existing citation, pre-filled and with no additional typing needed. More info here. Orange Suede Sofa (talk) 02:40, 15 August 2023 (UTC)[reply]

[edit]

This discussion emerges from those on Billjones94's talk page, my talk page, and other previous discussions (1 again @ Billjones94, 2 again @ Billjones94, 3 @ Wikipedia:Bots, 4 @ Wikipedia talk:Link rot, 5 @ Village pump). Tags: Billjones94, Rhododendrites, Scyrme, Novem Linguae, ActivelyDisinterested, Izno, Kuzma, GreenC, Folly Mox, Dhtwiki, DMacks, and Cyberpower678. Please tag others.

I propose an addition to this page as follows:

After the words "in general, do not" in the second paragraph of the lede, insert "(with automated tools or otherwise) add archive links for live websites or".

The paragraph would then read In general, do not (with automated tools or otherwise) add archive links for live websites or delete cited information solely because the URL to the source does not work any longer.

My understanding, for the record, is that links cited on the English Wikipedia are automatically archived. Hitting the check mark in the IA Bot to add those archived links for live sites does not archive anything. It does not actually archive those pages nor does it update those archives. It just adds the links themselves to the article text. Moreover, archive links are automatically substituted for links that become dead.

What these archive links for live websites do is profoundly clutter the editor. This makes it very difficult for humans to parse. An example of this is the old version of Julius Caesar. This was a single citation (for a half of a sentence about Caesar's wife) therein:

<ref>Suetonius, ''Julius'' [https://penelope.uchicago.edu/Thayer/E/Roman/Texts/Suetonius/12Caesars/Julius*.html#1 1] {{Webarchive|url=https://archive.today/20120530163202/http://penelope.uchicago.edu/Thayer/E/Roman/Texts/Suetonius/12Caesars/Julius*.html#1 |date=30 May 2012 }}; Plutarch, ''Caesar'' [https://penelope.uchicago.edu/Thayer/E/Roman/Texts/Plutarch/Lives/Caesar*.html#1 1] {{Webarchive|url=http://webarchive.loc.gov/all/20180213130122/http://penelope.uchicago.edu/Thayer/e/roman/texts/plutarch/lives/caesar%2A.html#1 |date=13 February 2018 }}; Velleius Paterculus, ''Roman History'' [https://penelope.uchicago.edu/Thayer/E/Roman/Texts/Velleius_Paterculus/2B*.html#41 2.41] {{Webarchive|url=https://web.archive.org/web/20220731043323/https://penelope.uchicago.edu/Thayer/E/Roman/Texts/Velleius_Paterculus/2B%2A.html#41 |date=31 July 2022 }}</ref>

When I removed these archive links en masse, the page shortened by over 28,000 characters (probably upward of 35,000 after including all of my edits). Again, these additions are not necessary to preserve the text of the cited source. This is a live website. And if it became dead the archive URL would be automatically inserted. The costs are, however, substantial for active editors. Just finding real article text, as opposed to background mark up, in articles packed with these archive links becomes difficult.

Moreover, removing these archive links is significantly more difficult than adding them. It is almost trivial for someone to add unnecessary archive URLs. Not to pick on Billjones94 (the selection is merely because this series of discussions emerges from an edit on Roman Republic),[a] the following edits were all done within a single hour:

Billjones94 contribs log, excerpt

03:01, 16 April 2022 diff hist +5,778‎ Mohammedan SC (Dhaka) ‎ Rescuing 28 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:58, 16 April 2022 diff hist +2,468‎ Churchill Brothers FC Goa ‎ Rescuing 14 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:54, 16 April 2022 diff hist +15,029‎ Pune FC ‎ Rescuing 78 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:49, 16 April 2022 diff hist +8,917‎ Salgaocar FC ‎ Rescuing 48 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:46, 16 April 2022 diff hist +425‎ Sreenidi Deccan FC ‎ Rescuing 2 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:45, 16 April 2022 diff hist +1,984‎ Moinuddin Khan (footballer) ‎ Rescuing 9 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:44, 16 April 2022 diff hist +1,246‎ Punjab FC ‎ Rescuing 6 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:41, 16 April 2022 diff hist +1,937‎ Sudeva Delhi FC ‎ Rescuing 10 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:39, 16 April 2022 diff hist +8,071‎ Sporting Clube de Goa ‎ Rescuing 42 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:36, 16 April 2022 diff hist +1,841‎ FC Kerala ‎ Rescuing 9 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:32, 16 April 2022 diff hist +3,159‎ Mohammed Rahmatullah ‎ Rescuing 16 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:30, 16 April 2022 diff hist +7,003‎ Kerala United FC ‎ Rescuing 36 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:27, 16 April 2022 diff hist +5,427‎ Peerless SC ‎ Rescuing 26 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:25, 16 April 2022 diff hist +3,337‎ NEROCA FC ‎ Rescuing 16 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:21, 16 April 2022 diff hist +2,285‎ Aizawl FC ‎ Rescuing 12 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:19, 16 April 2022 diff hist +1,720‎ TRAU FC ‎ Rescuing 9 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:16, 16 April 2022 diff hist +1,783‎ FC Kochin ‎ Rescuing 9 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:14, 16 April 2022 diff hist +11,674‎ Dempo SC ‎ Rescuing 63 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:10, 16 April 2022 diff hist +4,811‎ Mahindra United FC ‎ Rescuing 27 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:07, 16 April 2022 diff hist +3,616‎ South United FC ‎ Rescuing 19 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:05, 16 April 2022 diff hist +3,532‎ Hindustan Aeronautics Limited SC ‎ Rescuing 18 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

02:02, 16 April 2022 diff hist +4,435‎ ONGC FC ‎ Rescuing 25 sources and tagging 0 as dead.) #IABot (v2.0.8.7 thank Tag: IABotManagementConsole [1.2]

Not a single source was tagged as dead. The average edit added 4,657 bytes of text and in total this single hour of triggering IA Bot added 100,478 bytes to Wikipedia's servers.[b] I also firmly believe that these edits fall within the scope of WP:MEATBOT and WP:FAITACCOMPLI. Undoing them one by one after intervening edits is extremely difficult; Billjones94 has been repeatedly informed and tagged of how these mass additions are controversial with absolutely no response beyond "Thanks" on talk page edits. Nor do I believe for a second that anyone can review 11,674 bytes of additions – Dempo SC; around 3,000 bytes reviewed per minute – in the elapsed four minutes between the last edit.

Moreover, the archive links are also generated for paywalled sources hosted on Jstor (other services like Cambridge Core or Oxford Academic suffer similarly). For example, at Roman Republic:

<ref>{{Cite journal |last=Steel |first=Catherine |date=2014 |title=The Roman senate and the post-Sullan "res publica" |url=https://www.jstor.org/stable/24432812 |journal=Historia: Zeitschrift für Alte Geschichte |volume=63 |issue=3 |pages=323–339 |doi=10.25162/historia-2014-0018 |jstor=24432812 |s2cid=151289863 |issn=0018-2311 |access-date=26 May 2022 |archive-date=26 May 2022 |archive-url=https://web.archive.org/web/20220526152815/https://www.jstor.org/stable/24432812 |url-status=live }}</ref>

In those cases, the archive links do not preserve anything at all. Going to the archive URL loads a single front page of the article. On my computer the image thereof does not even load, leaving a blank page with the citation at the right. Given the stability of Jstor, there are functionally no benefits to these paywalled archive links. The costs in the editability of these articles remains however. Inasmuch as nothing is added for readers, editing ought to take priority.

Concluding, I want to emphasise three things. First, link rot is a semi-solved problem in which these WP:MEATBOT-esque additions do not help. Second, the enormous volume and rapidity of these WP:FAITACCOMPLI additions make them both harmful to actual content contribution and difficult to remove. Third, many times these archive links add nothing between the website still being live and paywalled sources' archives still being paywalled. We should edit the guidelines to reflect these facts and require adding archive links for live URLs to be justified instead of accepted by default.[c][d] Ifly6 (talk) 16:11, 21 September 2023 (UTC)[reply]

Notes

  1. ^ The edit in question was this one. It tagged zero sources as dead and added 6,000 bytes to the article. Triggering it took probably like 15 seconds. Doing nothing and removing it later would have taken hours.
  2. ^ [OP] Descriptive statistics of those edits: n is 22, mean is 4567.18, median is 3434.5, max is 15029, min is 425.
  3. ^ To be clear, I [Ifly6] am not against archive links for live URLs in all cases, I think the following are examples of reasonable justifications: reasonable expectation of the source imminently becoming dead, actual evidence that the archive bypasses paywalls or the GDPR, or actual evidence of the source actually changing.
  4. ^ My [Ifly6] interpretation of In general, do not (with automated tools or otherwise) add archive links for live websites is that it should not be done without justification and addition of archive links for live websites would then become an affirmatively justified burden.
I don't think it follows we should have an RfC to add an archive link to every URL on Wikipedia (with some exceptions). That probably will fail. The feature in question here with selective articles has some utility, the question is should we continue to have this feature on enwiki and if so under what conditions - anyone can run it anytime, only certain users, only x times a day, etc.. what are the guidelines for this feature? Right now there are none, other than it has to be initiated manually which slows the user down some. -- GreenC 16:50, 21 September 2023 (UTC)[reply]
  • Not sure I totally agree, most readers are not editors, so having a function link is useful. However archiving Google books or archive.org seems rather redundant, also URLs to Jstor seem completely redundant when |jstor= exists. -- LCU ActivelyDisinterested transmissions °co-ords° 16:41, 21 September 2023 (UTC)[reply]
    A bot to extract Jstor URLs to the Jstor parameter was something I floated in the discussion @ WP:Bots (policy seems to say requirements or goals should be specified first etc; if anything it reminds me of how annoying government contracting can be). I also raised a discussion on MediaWiki re changing Citoid to use the Jstor parameter but it doesn't seem that went anywhere because it would have to be a parameter on like all Wikipedias. Fixing it after the fact for CS templates seems easier to implement. I would also be fine with guidance that Jstor URLs shouldn't appear on Jstor articles but that is a separate issue. Ifly6 (talk) 17:35, 21 September 2023 (UTC)[reply]
  • Some nuance would be better. There are multiple issues here. Working backwards, archives to an online version of printed material on a stable platform, like jstor or gbooks, are indeed essentially useless. I don't think such citations should get archive-urls, nor access-dates, and since we have a bespoke |jstor= parameter, citations to jstor in particular should not get urls either, since that encourages the addition of crufty archives and access dates.
    I don't really buy the argument that addition of archive urls to live links requires human review, except for urls whose contents change frequently, in which case the archived version may not match the cited version. An online edition of some translation of an ancient or classical text, as used in the example, is not likely to change, so any date of the archive is probably correct, although currently unnecessary. I notice that none of the example references from the Julius Caesar article mentions the date of composition of the original work, the date of publication of the translation, nor the identity of the translator or the publisher of the translation, all of which are more serious issues than an unnecessary archive.
    Prose can be located within citation syntax with the use of syntax highlighting, and there's at least one gadget that does this. Moving to list-defined references or shortened footnotes also fixes that issue, and while I'm not a fan of adding archives to live links for many domains, I don't think guidance should be changed because the laziest referencing style of just fully defining the reference at its initial point clutters the source. That's more of an argument to encourage better citation styles.
    I also think the exact proposed wording is a bit ambiguous, in one sentence talking about live links and then dead links, without a clear separation. It could be read as discussing cases of adding archives to live websites... where the URL no longer works, which is contradictory. I must stress this is not an argument against the idea, just the proposed text.
    I have, in at least one case, been forced to add archives to live links, due to the way IABot defines "live links". I was cleaning up citations on an article about some museum in Germany. The article cited something like 140 different pages on the museum's site, but they had restructured their domain such that all the links now resolved to a custom 404 page, which automated tools understand as a "live link". After updating the first sixteen or so manually over an hour or two, I despaired and asked IABot to add archives to live links, since it couldn't understand that the links were actually useless.
    Some domains do drop articles with some frequency. Local news sites, sohu.com, obituary sites, etc. When I'm citing a source like this, I'll add the archive to the live site prophylactically, since I deem it unlikely to work within a year or so. Folly Mox (talk) 17:09, 21 September 2023 (UTC)[reply]
    I see some of what I just mentioned is already addressed in footnotes. Sorry; I'm still waking up. Folly Mox (talk) 17:12, 21 September 2023 (UTC)[reply]
    To be clear, I agree that archive URLs for live websites (whether that be IA Bot's idiosyncratic "live" or actually "live" but different) can have some utility. I would, however, oppose their indiscriminate addition. Cluttering up the editor with tens of thousands of bytes is a problem; I think doing so should need justification with facts specific to the circumstances. Ifly6 (talk) 17:44, 21 September 2023 (UTC)[reply]
    Re-reading what I wrote above, perhaps instead word as, prior to the current sentence in paragraph 2 of the lede Do not add archive links for live websites (using automated tools or otherwise) without a justification specific to the circumstances? Ifly6 (talk) 18:00, 21 September 2023 (UTC)[reply]
  • Since I was pinged, I’ll just state that I’m impartial to this particular matter, but will also mention that Ifly6, hasn’t consider the potential for content drift, especially on news related articles. An archive can preserve the integrity of the original version while a live link may continue to change the content of its page over time.—CYBERPOWER (Around) 18:37, 21 September 2023 (UTC)[reply]
    I don't think that's wholly accurate, my note (discussing good reasons for archiving live pages) which has a last element talking about the possibility of changes to the live website has been present since the original post was made. Ifly6 (talk) 18:47, 21 September 2023 (UTC)[reply]
    If content drift within a source is due to corrections, then we shouldn't stay with a possibly erroneous archive snapshot. This is in line with why it's doubtful to rely on archive snapshots when a dead-URL is the result of website reorganization, where changing the original link is the better solution. Dhtwiki (talk) 04:27, 22 September 2023 (UTC)[reply]
  • With regard to Moreover, archive links are automatically substituted for links that become dead. Did you not mean "...are [not] automatically substituted...", which is my experience and is in line with the point you're making. I'm in certainly favor of changing this article's wording to discourage mass archive-link additions, especially since this guideline is often used as justification for such activity. I've also heard people justify mass additions as being required by featured article reviewers, although I've only seen recommendations to that effect. Perhaps we should clarify things with the FAC page. Dhtwiki (talk) 04:35, 22 September 2023 (UTC)[reply]
    The way I read what I wrote – archive links are automatically substituted for links that become dead – is that it reflects this on the opposite (non-talk) side: there is a Wikipedia bot ... that automates fixing link rot. It runs continuously, checking all articles on Wikipedia if a link is dead, adding archives to Wayback Machine (if not yet there), and replacing dead links in the wikitext with an archived version. Ifly6 (talk) 06:10, 22 September 2023 (UTC)[reply]
    Do you know the name of that bot? I've not come across it. -- LCU ActivelyDisinterested transmissions °co-ords° 09:00, 22 September 2023 (UTC)[reply]
    IABot, when it runs by itself, only places archive links when it detects a dead link as well as setting the url-status to "dead". Since that is a sensible approach, by the same tool that the massive adders are using, why isn't that the bot option that is required: then people can run the bot in that configuration, to detect dead-URLs and add archive snapshots only when the original has died, (almost) as often as they want. Dhtwiki (talk) 09:30, 22 September 2023 (UTC)[reply]
    I'm not sure IABot runs automatically, or at least I've come across pages with long dead links that haven't had archives added. -- LCU ActivelyDisinterested transmissions °co-ords° 10:50, 22 September 2023 (UTC)[reply]
    Do you have any examples of this? I'm unaware of such cases. Ifly6 (talk) 15:34, 22 September 2023 (UTC)[reply]
    Many links don't get archived for many reasons. Log into iabot.org and manage domains. If a domain is set to "subscription" or "permalive" it won't be archived. You can change it. Just be careful things are usually not as simple as they seem ie. some links in a domain work others do not. In those cases you can manage individual URLs the same way. -- GreenC 16:40, 23 September 2023 (UTC)[reply]
    Is this feature to ignore links (like Jstor?) overridden when The Check Box is used? Ifly6 (talk) 03:29, 24 September 2023 (UTC)[reply]
    I don't know. It's easy to check. Create a page in user space, add some citations, run the bot on the page. -- GreenC 20:51, 24 September 2023 (UTC)[reply]
    The Check Box forces addition of useless Jstor archives per test, giving a link to a largely blank page. Ifly6 (talk) 02:44, 25 September 2023 (UTC)[reply]
  • I see 0 reason to prohibit preemptive addition of archive links, and certainly oppose the suggested amendments above. I abhor having to go and find archive links long after a link has gone dead. Preemptive addition of an archive link helps ensure that the archive link actually has provided an archive, since we all know that archive.org does a bad job tracking websites that themselves do a bad job indicating they're no longer serving a resource at a specific URI (which is not archive.org's fault). If you have a specific issue with specific editors editing in a WP:MEATBOT way, you should approach them and/or discuss their cases at the appropriate dispute resolution forum. As for "JSTOR links don't need archiving" (which I agree with), the solution is not "stop adding all links to archive.org", the solution is request an adjustment to IABot not to archive such links. And/or revisit the decision made some years ago to stop Citation bot from removing URIs to links that duplicate identifiers. In general, none of the three issues described here lead to a resolution which is "discourage users in the closest thing to a guideline on the topic". Izno (talk) 20:44, 25 September 2023 (UTC)[reply]
    I don't see how this response engages with the text of the change: you can hit the "include URLs on all" check box if you can produce a specific justification... like actual evidence of content having been moved. The archive bots also already read |access-date= and pick archives prior to or around that date. Ifly6 (talk) 22:07, 25 September 2023 (UTC)[reply]

Paywalled landing pages should not be archived

[edit]
Links to some previous discussions

The above discussion (Mass additions of archive links for live sites) discusses why indiscriminate archiving of live links is a problem and a broader mechanism – don't add archive links for live websites unless you have an actual and specific reason – for resolving them.

Per Folly Mox's minimalist framing of the question at Talk:Citation bot, I propose the following (with appropriate wording to be determined):

  • Archive links to paywalled landing pages should not be added.
  • Assuming that marking as permanently alive prevents the IA Bot check box from adding archive links, IA Bot should mark everything on Jstor (and other sites of similar expected permanence like PubMed) to be permanently alive. @GreenC: Does permalive prevent the check box in IA Bot from adding archive URLs?

As to the utility of archive links of paywalled landing pages, they are not useful and they do not provide full text. They are currently being added if you hit the check mark in the IA Bot management console. The resulting links are largely blank. They do not archive anything or trigger anything to archive anything while introducing extremely large amounts of markup with no value. Ifly6 (talk) 05:21, 2 October 2023 (UTC)[reply]

  • Agree / Support / Cosign. Archives of paywall landing pages (the kind I just removed from an article) have negative value. The paywall landing page does not display the content that supports the prose citing it, and there's no way to navigate to that content from the archive. You have to go to the live site to log in or purchase a subscription.
    While they're not useful for verification, they do significantly bloat the wikitext and the references displayed to readers.
    In almost all cases, an archive of a Jstor page does nothing apart from bloat the text. The proportion of the 0.1% of readers who click through a reference link who choose to click the archive link instead of the original, will cause a tiny server load for Internet Archive instead of Jstor.
    (In a small minority of cases, the Jstor original will be open access, so the archive will actually support the prose the way the original does. I assume that Jstor occasionally suffers outages, and during those brief windows of time an archive to an open access article will actually have greater value than the non-functional original. I assume it's also possible that Jstor increases access requirements for hosted articles, paywalling previously free ones, in which case an archive could bypass the newly erected paywall. None of these open access cases affect the main thrust of this proposal.)
    Jstor is basically always up, and never changes the parts of its site structure that host content, and so setting it to permalive should not affect the availability of its contents at all, which are mostly paywalls to as far as Internet Archive is concerned anyway. The "open access case" can be untangled later, if it is deemed important enough to bother about.
    What I'm hoping to get from this discussion is consensus that we shouldn't have archives of paywall landing pages. The permalive for User:IABot is an implementation detail, and how we can go about cleaning up this cruft is also a follow up. Any bot task requires consensus first (unless speedy approved by a BAG member). Folly Mox (talk) 06:52, 2 October 2023 (UTC)[reply]

external links: URLs that were broken due to editing errors

[edit]

(crosspost from Wikipedia:Teahouse#external links: URLs that were broken due to editing errors)

hello maintainers. I made a List of ~10000 brocken URLs User:ⵓ/Worklist brocken URLs with Quarry:query/78127 (feel free to fork it)

The SQL query filters not existing top level-domains, all URLs in this List are broken. The Domain in list is in reversed order (el_to_domain_index)

Most of the cases are easy to fix (i.e remove a white space-character). In some cases I needed a URL-decoder ( meyerweb.com/eric/tools/dencoder/). More difficult cases can only be solved with the help of the version history or with the help of web archives and Google search. (i.e. https://wiki.riteme.site/w/index.php?title=2018%E2%80%9319_Ukrainian_First_League&diff=prev&oldid=1185545810 )

I fixed this kind of errors in german wikipedia so Quarry:query/77794 is clean. But I am not able to do this in english wikipedia. (talk) 17:57, 17 November 2023 (UTC)[reply]

I am looking for users who specialize in external link / link rot maintenance (talk) 07:27, 18 November 2023 (UTC)[reply]
[edit]

I thought that this might be of interest to this community.

  • Chapekis, Athena; Bestvater, Samuel; Remy, Emma; Rivero, Gonzalo (2024-05-17). "When Online Content Disappears". Pew Research Center. Retrieved 2024-05-20.

Peaceray (talk) 04:24, 20 May 2024 (UTC)[reply]

This Pew report is not very good surprisingly given their reputation for authority. The word "soft-404" appears nowhere in the document, yet this is one of the hardest problems in link rot detection, and accounts for a sizeable portion of all link rot. It looks like they simply checked for 404 links. They consider redirects, but these are often correct and not a problem. Many 404s can be made live again by replacing with a new URL (work done at WP:URLREQ). They discuss Wikipedia, but don't mentioned archive URLs, it's unknown if they are counting links as dead even though they have a live archive URL. Many devils in the details they pass over, so I'm not sure how useful this report is other than "many links die", which has been known for 30 years. I hope folks on Wikipedia understand this is an existential problem for our project, it's easy to imagine a wasteland in a few decades where most things are unverifiable, and a massive content deletion project begins to "clean up" per WP:V. -- GreenC 16:10, 20 May 2024 (UTC)[reply]