Wikipedia:Link rot/URL change requests/Archives/2024/July
This is an archive of past discussions on Wikipedia:Link rot. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current main page. |
mtv.com
All mtv.com/news links have broke according to https://variety.com/2024/digital/news/mtv-news-website-archives-pulled-offline-1236047163/. Looks like we have several thousand references. --Nintendofan885T&Cs apply 22:52, 24 June 2024 (UTC)
(edit conflict) Variety is reporting that 20 years of MTV News archives have been pulled. A few I've tested seem to support that:
- https://www.mtv.com/news/3r5xfl/dungeons-and-dragons-arena-of-war
- https://www.mtv.com/news/olxhhg/joe-manganiello-dungeons-dragons-movie
- https://www.mtv.com/news/bhktbo/dungeons-and-dragons-online-character-creator-official-released
Thanks! Sariel Xilo (talk) 23:09, 24 June 2024 (UTC)
FYI looks like mtv.com is marked as permalive on IABot so it will ignore the links on other wikis (as GreenC bot only processes enwiki) --Nintendofan885T&Cs apply 23:23, 24 June 2024 (UTC)
- WaybackMedic can edit the IABot database, changing target links to permadead, which then propagate to the other wikis, via IABot. I will also process all MTV links on enwiki as normal, and see what other soft-404 rules might be discovered, which can also be applied to the IABot database. -- GreenC 01:30, 25 June 2024 (UTC)
- 20,263 pages mtv.com/*
- Soft-404 Rules: When a redirect matches the regex, and original URL does not match. Example for rule #1: [1] redirects to [2]
([.]|[/])mtv[.]com([.](tw|br|au))?(/(music|#))?/news/?
([.]|[/])mtv[.]com([.](tw|br|au))?/?$
([.]|[/])paramountshop[.]com/?$
([.]|[/])mtv[.]com([.](tw|br|au))?/[?]xrs=PPM-18-10caf1c/?$
([.]|[/])mtv[.]com([.](tw|br|au))?/category/ffftn1/style/?$
([.]|[/])paramountplus[.]com/brands/mtv/#ftag=PPM-18-10caf1c/?$
([.]|[/])mtvema[.]com(/en-us)?/?$
([.]|[/])mtv[.]com([.](tw|br|au))?/videos/?$
- For Enwiki: Checked 19,690 pages and edited 18,113 pages. Moved 319 links to a new URL. Removed 2
{{dead link}}
templates. Added 787{{dead link}}
. Switched 11,106|url-status=live
to dead. Added 27,134 archive URLs (25,337 Wayback).- More work to do, about 700 pages missed due to Cirrus vs. SQL search incongruities. -- GreenC 01:10, 28 June 2024 (UTC)
- Done. Checked 643 pages and edited 192 pages. Moved 92 links to a new URL. Added 14
{{dead link}}
. Switched 4|url-status=dead
to live. Switched 51|url-status=live
to dead. Added 54 archive URLs (44 Wayback). -- GreenC 15:54, 1 July 2024 (UTC)
- Done. Checked 643 pages and edited 192 pages. Moved 92 links to a new URL. Added 14
- More work to do, about 700 pages missed due to Cirrus vs. SQL search incongruities. -- GreenC 01:10, 28 June 2024 (UTC)
- For IABot database updates: est. 3-4 days -- GreenC 00:23, 28 June 2024 (UTC)
- Done. Checked about unique 51,000 links. Over 98% are hard and soft dead. Uploaded the information to IABot which will be changing the links across 300+ wikis. -- GreenC 15:57, 1 July 2024 (UTC)
- Done with mtv.com -- GreenC 15:58, 1 July 2024 (UTC)
- @GreenC: Can you also check the following sites found through the certificate (don't match the above regex) that have seemingly also killed their news sections and have similar redirect-to-homepage deadlinks looking at Special:LinkSearch?
- mtv.pl (mtv.pl/newsy)
- mtv.co.uk (mtv.co.uk/news)
Thanks! --Nintendofan885T&Cs apply 19:21, 1 July 2024 (UTC)
- User:Nintendofan885, thanks. New sections below. -- GreenC 14:17, 2 July 2024 (UTC)
kp.by
Looks like there's a soft-redirect from kp.by to kp.ru links, such as "https://www.kp.by/daily/27084/4156223/" in Victory Day (9 May) being dead, but "https://www.kp.ru/daily/27084/4156223/" working GrapesRock (talk) 18:26, 25 June 2024 (UTC)
- 117 pages -- GreenC 16:22, 1 July 2024 (UTC)
- Done - Checked 121 pages and edited 116 pages. Moved 149 links to a new URL. Removed 33
{{dead link}}
templates. Added 2{{dead link}}
. Switched 10|url-status=dead
to live. Added 7 archive URLs (4 Wayback). -- GreenC 19:38, 1 July 2024 (UTC)
- Done - Checked 121 pages and edited 116 pages. Moved 149 links to a new URL. Removed 33
dailynews.gov.bw
URLs of the form "http://www.dailynews.gov.bw/news-details.php?nid=23359" that I've checked soft-redirect to "https://dailynews.gov.bw/news-detail/23359" (example from Gladys Olebile Masire) GrapesRock (talk) 20:58, 25 June 2024 (UTC)
newurl = <url> subs("http://", "https://", newurl) # normal site # http://www.dailynews.gov.bw/news-details.php?nid=23359 # https://dailynews.gov.bw/news-detail/23359 if newurl ~ "www[.]dailynews[.]gov[.]bw/news-details[.]php[?]nid=": subs("www.dailynews.gov.bw", "dailynews.gov.bw", newurl) subs("/news-details.php?nid=", "/news-detail/", newurl) # mobile site # http://www.dailynews.gov.bw/mobile/news-details.php?nid=10829&flag= # https://dailynews.gov.bw/news-detail/10829 if newurl ~ "www[.]dailynews[.]gov[.]bw/mobile/news-details[.]php[?]nid=": subs("www.dailynews.gov.bw", "dailynews.gov.bw", newurl) subs("/mobile/news-details.php?nid=", "/news-detail/", newurl) subs("&flag=", "", newurl)
- Soft-404: if redirect URL is
https://dailynews.gov.bw/page-not-found
- Done - Checked 91 pages and edited 71 pages. Moved 111 links to a new URL. Removed 20
{{dead link}}
templates. Added 6{{dead link}}
. Switched 22|url-status=dead
to live. Added 9 archive URLs (7 Wayback). -- GreenC 21:58, 1 July 2024 (UTC)
- Done - Checked 91 pages and edited 71 pages. Moved 111 links to a new URL. Removed 20
blackcountryhistory.org
From the page 1980, "http://blackcountryhistory.org/collections/getrecord/GB149_P_915/" soft redirects to "https://www.blackcountryhistory.org/collections/getrecord/GB149_P_915/" (adding a "www." to the start fixes the dead link, and might as well upgrade security). GrapesRock (talk) 17:33, 26 June 2024 (UTC)
- 123 pages -- GreenC 22:52, 1 July 2024 (UTC)
- User:GrapesRock - the website has CloudFlare protection at maximum level ("Click box if you are human"). My bot is unable to verify if the new URL is working. Given the low number of links, and simplicity of the move (adding "www" to the domain), I will go ahead and do a "blind move" ie. without verifying. -- GreenC 23:21, 1 July 2024 (UTC)
- Done - Checked 130 pages and edited 106 pages. Moved 133 links to a new URL. Removed 1
{{dead link}}
templates. Switched 28|url-status=dead
to live. Added 5 archive URLs (2 Wayback). -- GreenC 23:38, 1 July 2024 (UTC)
cmt.com
Country Music Television, sister company of MTV. Paramount cost cutting: https://www.savingcountrymusic.com/cmt-mtvs-eradication-of-editorial-content-is-a-catastrophe/
6,278 pages GreenC 02:17, 27 June 2024 (UTC)
- Enwiki - Checked 6,244 pages and edited 3,242 pages. Moved 47 links to a new URL. Removed 2
{{dead link}}
templates. Added 310{{dead link}}
. Switched 1|url-status=dead
to live. Switched 616|url-status=live
to dead. Added 5,635 archive URLs (5,327 Wayback). Changed 220 citation metadata fields. -- GreenC 14:05, 3 July 2024 (UTC) - IABot DB - Checked and updated 6,831 URLs.
Done -- GreenC 22:44, 3 July 2024 (UTC)
cc.com
Comedy Central. More Paramount cost cutting: https://www.hollywoodreporter.com/business/business-news/comedy-central-website-daily-show-clips-wiped-out-1235933345/amp/ -- GreenC 15:03, 27 June 2024 (UTC)
909 pages -- GreenC 15:11, 27 June 2024 (UTC)
Soft-404 when a link redirects to one of these:
([.]|[/])cc[.]com/[?]xrs=PPM-18-10caf1[chd]/?$
([.]|[/])paramountplus[.]com/shows/tosh-0/#ftag=PPM-18-10caf1d/?$
([.]|[/])paramountplus[.]com/shows/the-daily-show/#ftag=PPM-18-10caf1d/?$
([.]|[/])paramountplus[.]com/brands/comedy-central/#ftag=PPM-18-10caf1d/?$
([.]|[/])southpark[.]cc[.]com/seasons/south-park/?$
([.]|[/])cc[.]com/fan-hub/the-daily-show[?]xrs=PPM-18-10caf1d/?$
([.]|[/])southpark[.]cc[.]com/?$
([.]|[/])southpark[.]cc[.]com/wiki/Main/?$
Results:
- Enwiki - Checked 918 pages and edited 782 pages. Moved 135 links to a new URL. Removed 2
{{dead link}}
templates. Added 121{{dead link}}
. Switched 18|url-status=dead
to live. Switched 136|url-status=live
to dead. Added 656 archive URLs (638 Wayback). Changed 131 citation metadata fields. - IABot DB Checked 2,338 links of which modified 1,675. Changes will propagate to 300+ wikis via IABot.
Done -- GreenC 19:03, 4 July 2024 (UTC)
tvland.com
TV Land. More Paramount cost cutting: https://www.hollywoodreporter.com/business/business-news/comedy-central-website-daily-show-clips-wiped-out-1235933345/amp/ -- GreenC 15:08, 27 June 2024 (UTC)
100 pages -- GreenC 15:12, 27 June 2024 (UTC)
Soft-404 when a link redirects to:
paramountplus[.]com/browse/#ftag=PPM-18-10caf1i/?$
Results:
- Enwiki: Checked 106 pages and edited 42 pages. Moved 17 links to a new URL. Added 1
{{dead link}}
. Switched 2|url-status=live
to dead. Added 22 archive URLs (20 Wayback). Changed 8 citation metadata fields. - IABot DB: Checked 184 links and modified 172
Done -- GreenC 00:41, 5 July 2024 (UTC)
zdnet.com
"it went through major changes that broke many of its old links" - https://www.msn.com/en-us/news/technology/the-internet-s-memory-is-under-threat/ar-BB1oRToN
5,100 pages -- GreenC 19:01, 27 June 2024 (UTC)
Soft-404s are any redirects that match this:
zdnet[.]com/topic/(virtualization|microsoft|enterprise-software|apple|networking|education|storage|computing|hardware|social-media|government|smartphones|innovation|open-source|security|smb|collaboration|google|big-data)/?$
Bot results:
- Enwiki: Checked 5,148 pages and edited 2,442 pages. Moved 2,325 links to a new URL. Removed 2
{{dead link}}
templates. Added 52{{dead link}}
. Switched 32|url-status=dead
to live. Switched 116|url-status=live
to dead. Added 639 archive URLs (554 Wayback). Changed 403 citation metadata fields. - IABot DB: Checked 9,883 links and modified 2,629. Changes will propagate to 300+ wikis.
Done -- GreenC 02:24, 6 July 2024 (UTC)
mtv.pl
Paramount site per #mtv.com: mtv.pl (mtv.pl/newsy) 35 pages ---- GreenC 14:14, 2 July 2024 (UTC)
Done - Checked 35 pages and edited 19 pages. Added 3 {{dead link}}
. Added 39 archive URLs (39 Wayback). Changed 2 citation metadata fields. -- GreenC 02:46, 6 July 2024 (UTC)
mtv.co.uk
Paramount site per #mtv.com: mtv.co.uk (mtv.co.uk/news) 1,700 pages -- GreenC 14:15, 2 July 2024 (UTC)
Bot results:
- Enwiki - Checked 1,741 pages and edited 1,503 pages. Moved 942 links to a new URL. Removed 1
{{dead link}}
templates. Added 22{{dead link}}
. Switched 61|url-status=dead
to live. Switched 228|url-status=live
to dead. Added 755 archive URLs (688 Wayback). Changed 280 citation metadata fields.
- IABot DB - Checked 2,616 unique links and fixed 1,550 which will update on 300+ wikis
Done -- GreenC 21:59, 6 July 2024 (UTC)
catholicnewsagency.com
This page "http://www.catholicnewsagency.com/new.php?n=15190" soft-redirects to "http://www.catholicnewsagency.com/news/15190" which redirects to "https://www.catholicnewsagency.com/news/15190/cardinal-rouco-opens-cause-of-canonization-for-spanish-couple" (example from Opus Dei) GrapesRock (talk) 18:10, 2 July 2024 (UTC)
GrapesRock: In Charles of Sezze I found: http://www.catholicnewsagency.com/saint.php?n=416 .. do you think it soft-redirects? Also http://www.catholicnewsagency.com/document.php?n=147 in Maria Gabriella Sagheddu, and https://www.catholicnewsagency.com/martyrology_entry.php?n=596 in List of Catholic saints, and http://www.catholicnewsagency.com/resource.php?n=409 in Natural marriage, and http://www.catholicnewsagency.com/column.php?n=1360 in Mark Templeton (trombonist). There are probably more. -- GreenC 22:22, 6 July 2024 (UTC)
- The saint one seems to exist here https://www.catholicnewsagency.com/saint/st-charles-of-sezze-416 , so there's still that 416. I don't see an algorithmic way to figure out what belongs between saint/ and 416
- Can't find a newer location for http://www.catholicnewsagency.com/document.php?n=147
- Ditto on the martyrology link
- Ditto on the resource link
- The column link soft-redirects to https://www.catholicnewsagency.com/column/51359
- Seems like adding 49999 to the number works, at least in the example above and http://www.catholicnewsagency.com/column.php?n=2930 (from The Fault in our Stars) soft-redirecting to http://www.catholicnewsagency.com/column/52929
- GrapesRock (talk) 23:05, 6 July 2024 (UTC)
- Thanks. -- GreenC 01:28, 7 July 2024 (UTC)
Bot results:
- Enwiki - Checked 2,004 pages and edited 1,352 pages. Moved 1,754 links to a new URL. Removed 9
{{dead link}}
templates. Added 5{{dead link}}
. Switched 70|url-status=dead
to live. Switched 5|url-status=live
to dead. Added 101 archive URLs (90 Wayback). Changed 75 citation metadata fields. -- GreenC 01:28, 7 July 2024 (UTC) - IABot DB - Checked 3,445 unique links and changed 432 which will propagate across 300+ wikis.
Done -- GreenC 14:13, 7 July 2024 (UTC)
pac-12.com
Hello. Pacific 12 Conference links are not working. Some can be moved to a new URL while others cant:
- /article/ link that has a new URL: this is now here - /article/ is swapped to /news/, and adds aspx to the end for John Ross (American football).
- If the day number starts with a 0, the 0 gets removed.as well like this is now here for N'Keal Harry.
- However, this does not always work. Changing this to that gives a 404 for 2018–19 Pac-12 Conference men's basketball season. The article is also missing from their archives.
Miscellaneous links that do not work that are not under the /article/ format include /content/, and /event/.
Thanks! MrLinkinPark333 (talk) 19:17, 5 July 2024 (UTC)
- Enwiki: Checked 879 pages and edited 825 pages. Moved 736 links to a new URL (per soft-redirect rules given above). Added 72
{{dead link}}
. Switched 9|url-status=dead
to live. Switched 45|url-status=live
to dead. Added 1,261 archive URLs (1,235 Wayback). Changed 135 citation metadata fields. - IABot DB: Checked 1,193 unique links and updated 890. Will propagate across 300+ wikis via IABot.
-- GreenC 04:35, 10 July 2024 (UTC)
- Looking through the remainder links, I see the third link above now redirects here. The date was wrong in the old URL.Therefore, I think this could be revisited later to see if any more links have been moved. I don't think it's worth going through them again now as I've only found that one so far today. MrLinkinPark333 (talk) 22:56, 10 July 2024 (UTC)
- Hmm that exposes a logic flaw in my bot. If there is a soft-redirect defined, it checks the status and when it fails, it assumes the URL is dead. In this case, the soft-redirect failed but a hard redirect existed. It adds overhead to check again so I never do, I am presuming the original link must be dead. As always, can't presume. The question is the overhead worth it - yes for this site, maybe not for others. It's worth retrying a set of articles to see what happens -- GreenC 01:04, 11 July 2024 (UTC)
- The link wasn't working on Friday but it is today. This makes sense as when I asked Pac-12 about the broken links on Friday, they said not all of the links were redirecting while they're reworking the site. MrLinkinPark333 (talk) 01:40, 11 July 2024 (UTC)
- Hmm that exposes a logic flaw in my bot. If there is a soft-redirect defined, it checks the status and when it fails, it assumes the URL is dead. In this case, the soft-redirect failed but a hard redirect existed. It adds overhead to check again so I never do, I am presuming the original link must be dead. As always, can't presume. The question is the overhead worth it - yes for this site, maybe not for others. It's worth retrying a set of articles to see what happens -- GreenC 01:04, 11 July 2024 (UTC)
- ω Awaiting site to be reworked -- GreenC 23:34, 11 July 2024 (UTC)
tampabay.com
Both /blogs/soundcheck
and /blogs/the-buzz-florida-politics
extensions seem to be removeable from the link to achieve a soft-redirect.
Examples:
- http://www.tampabay.com/blogs/soundcheck/xxxtentacion-announces-free-show-at-the-orpheum-in-tampa-this-saturday/2335772 soft-redirects to https://www.tampabay.com/xxxtentacion-announces-free-show-at-the-orpheum-in-tampa-this-saturday/2335772/ (from XXXTentacion)
- http://www.tampabay.com/blogs/the-buzz-florida-politics/rubio-comes-out-in-support-of-medical-marijuana-but-not-ballot/2190709 soft-redirects to https://www.tampabay.com/rubio-comes-out-in-support-of-medical-marijuana-but-not-ballot/2190709/ (from Marco Rubio)
GrapesRock (talk) 00:58, 7 July 2024 (UTC)
6,800 pages -- GreenC 16:58, 10 July 2024 (UTC)
Bot results:
- Enwiki: Checked 6,846 pages and edited 2,329 pages. Moved 2,533 links to a new URL. Removed 3
{{dead link}}
templates. Added 83{{dead link}}
. Switched 214|url-status=dead
to live. Switched 84|url-status=live
to dead. Added 944 archive URLs (751 Wayback). Changed 164 citation metadata fields. - IABot DB: Checked 9,980 unique links and updated 3,366 which will propagate across 300+ wikis via IABot
Started a thread about this at User talk:GreenC bot#Tampabay.com. ▶ I am Grorp ◀ 00:43, 12 July 2024 (UTC)
- Done -- GreenC 04:35, 12 July 2024 (UTC)
bet.com
Black Entertainment Television (Paramount) - 2,100 pages -- GreenC 02:01, 7 July 2024 (UTC)
Soft-404 redirect rules:
bet-awards(/(nominees|performers|photos))?/?$
soul-train-awards(/nominees)?/?$
hip-hop-awards(/(nominees|photos|videos))?/?$
shows(/soul-train-awards)?/?$
vertical(/(jo1ilh/celebrity|o2fii9/news))?/?$
topic(/betexperience)?/?$
Bot results:
- Enwiki: Checked 2,143 pages and edited 1,522 pages. Moved 1,190 links to a new URL. Added 57
{{dead link}}
. Switched 36|url-status=dead
to live. Switched 71|url-status=live
to dead. Added 620 archive URLs (584 Wayback). Changed 263 citation metadata fields. - IABot DB: Checked 3,087 unique links and changed 1,164 which will propagate across 300+ wikis via IABot
Done -- GreenC 04:22, 13 July 2024 (UTC)
abclocal.go.com
Soft-redirects that I found:
- kfsn -> abc30.com
- kgo -> abc7news.com
- /story?section=news/local/los_angeles -> abc7news.com
- kabc -> abc7.com
- kgo -> abc7.com
- wabc -> abc7ny.com
- wls -> abc7chicago.com
- wtvd -> abc11.com
- ktrk -> abc13.com
- http://abclocal.go.com/ktrk/story?section=news/local&id=8700136 soft-redirects to https://abc13.com/archive/8700136/ (from BP)
These ABC affiliates don't seem to work: wpvi, wjrt, wtvg
I'm sure that there's other ones that didn't appear in the first two pages of searching abclocal.go.com (and I can test them if you link them). I found the correct domain by searching the thing directly after .com/ (such as ktrk) on the internet and selecting the corresponding ABC affiliate (and found the Elton John one by searching the title given in the WP article).
(and a question: is this page just for moving *dead* links? I think all pages with espn.go.com in the domain now are either dead or redirect normally) GrapesRock (talk) 16:27, 7 July 2024 (UTC)
- Looks like some wpvi links actually do redirect, such as http://abclocal.go.com/wpvi/story?section=news/politics&id=6038619 to https://6abc.com/archive/6038619/ (from Dana Redd).
- Some, however, do not, such as http://abclocal.go.com/wpvi/story?section=entertainment&id=4498224 (from Elton John). GrapesRock (talk) 18:33, 7 July 2024 (UTC)
- To answer your question, the bot can/will also move redirects that it encounters. Assuming they pass any soft-404 rules. -- GreenC 19:24, 7 July 2024 (UTC)
- GrapesRock, good discovery. The bot checked all abclocal.go.com links on enwiki, it didn't find any additional affiliates. It attempted the soft-redirects per rules above. If not live, it added an archive URL. One thing I did not check, it's possible the new soft-redirect URL when it returns 404 it previously worked and thus has an archive URL, but the one's I manually checked either don't exist or end up soft-404 (redirect to home page), so I didn't look for those. I did check wjrt (abc12.com) and wtvg (13abc.com) for any working soft-redirects. -- GreenC 21:52, 13 July 2024 (UTC)
Bot results:
- Enwiki: Checked 1,857 pages and edited 1,422 pages. Moved 1,389 links to a new URL. Removed 13
{{dead link}}
templates. Added 74{{dead link}}
. Switched 650|url-status=dead
to live. Switched 18|url-status=live
to dead. Added 301 archive URLs (184 Wayback). Changed 30 citation metadata fields. - IABot DB: Checked 3,938 unique links and updated 3,889 which will propagate to 300+ wikis via IABot
- Done -- GreenC 02:32, 14 July 2024 (UTC)
ew.com
Hello, Old URLs for Entertainment Weekly that mainly consist of numbers don't work. These links can be sorted into multiple categories:
- 1) Links that already have an archived copy in the article: this link at Don't Tell Me (Avril Lavigne song) is here.
- 2) Links that can be moved over to a new URL: this should go to that for Buckshot_LeFonque_(album). This has /ew/ removed, while adding date and title in the URL.
- 3) Links at a new URL but not in the date/title format: this is now here for Heavy Competition.
- 4) Links that can't be moved over to a new URL: this link at Fast Times at Barrington High needs an archived copy.
Since the new URLs don't always match the date/title format, I would like the broken links to be focused on first Then, if any of the archived links have a working URL in the date/title format, they could be converted over. Any links that don't have a matching date/title format can keep the archived URL because I can't predict the new URL.
- http://ew.com/ew/ 18
- https://ew.com/ew/ 188
- http://www.ew.com/ew/ 6000
- https://www.ew.com/ew/ 3154
These numbers include ones that are already fixed, such at the above link at Don't Tell Me. Thanks! MrLinkinPark333 (talk) 18:54, 4 July 2024 (UTC)
For #2 this is one of those dead redirects that eventually leads to the answer. It's a multi step process:
- Convert http://www.ew.com/ew/article/0,,302934,00.html to https://redir.ew.com/ew/article/0,,302934,00.html
- Run this which finds a redirect saved in the Wayback Machine:
wget -q -O- 'https://web.archive.org/cdx/search/cdx?url=https://redir.ew.com/ew/article/0,,302934,00.html&MatchType=prefix' | awk -v u="https://redir.ew.com/ew/article/0,,302934,00.html" '/text\/html 30[12]/{a[++i]=$2}END{print "https://web.archive.org/web/" a[i] "/" u}'
- With the answer from #2 run this:
curl -ILs 'https://web.archive.org/web/20160606115016/https://redir.ew.com/ew/article/0,,302934,00.html' | /usr/local/bin/awk '/^[ ]*[Ll]ocation:/{sub("^[ ]*[Ll]ocation:[ ]*https?://web[.]archive[.]org/web/[0-9]{14}id_/", "", $0); a[++i]=$0}END{print a[i]}'
- Which produces: https://web.archive.org/web/20151030104409/http://www.ew.com/article/1994/07/15/buckshot-lefonque .. from which extract the answer: http://www.ew.com/article/1994/07/15/buckshot-lefonque
For #3 same as #2, although in this example it leads to a soft-404.
For #4 same as #2. There are over 100,000 URLs in the wayback machine to redir.ew.com so there should be a good chance of finding, though not for this example. -- GreenC 22:27, 7 July 2024 (UTC)
- If that's easier than extracting the title/date, that could work. Didn't know this would be a complex change. Maybe more URLs could be fixed than I thought, even the ones that already have archived copies! MrLinkinPark333 (talk) 22:39, 7 July 2024 (UTC)
- Well, like in https://ew.com/ew/article/0,,302934,00.html there is no title/date in the URL string, and the page itself is dead, so the only way is search the Wayback Machine for old redirects. Unless you have another idea. It does appear to be working pretty well so far getting a lot, it's only slow because of the multiple I/O steps, and large number of links to process. -- GreenC 00:55, 8 July 2024 (UTC)
- I mean title/date in the citation would be added to the URL string like at buckshot. Then again, there's not always dates nor does the title always match the URL like for Heavy Competition. Searching https://web.archive.org/web/*/http://www.ew.com/article/2009/04/17/* for Heavy Competition doesn't give an archived updated URL. MrLinkinPark333 (talk) 20:37, 8 July 2024 (UTC)
- Ah yeah that method will be very hit or miss because it depends on the title in Wiki matching exactly the title in the URL. It worked for a two-word title, but I suspect for longer titles the URL drops words and punctuation. Maybe with a lot of experimenting. It also depends on the date being available. For now, I am doing the above old-redirects-in-wayback method which is getting a lot of positive results. -- GreenC 15:13, 9 July 2024 (UTC)
- I mean title/date in the citation would be added to the URL string like at buckshot. Then again, there's not always dates nor does the title always match the URL like for Heavy Competition. Searching https://web.archive.org/web/*/http://www.ew.com/article/2009/04/17/* for Heavy Competition doesn't give an archived updated URL. MrLinkinPark333 (talk) 20:37, 8 July 2024 (UTC)
- Well, like in https://ew.com/ew/article/0,,302934,00.html there is no title/date in the URL string, and the page itself is dead, so the only way is search the Wayback Machine for old redirects. Unless you have another idea. It does appear to be working pretty well so far getting a lot, it's only slow because of the multiple I/O steps, and large number of links to process. -- GreenC 00:55, 8 July 2024 (UTC)
- Yesterday there was a multirack hardware outage at archive.org that may take some time to fully recover, some of the services I need for this work are running slow or intermittent. -- GreenC 14:37, 8 July 2024 (UTC)
- MrLinkinPark333, I need to scale back the number of parallel processes to 2 because it's slow and tying up my rate-limited slots at the archive. This works but it will be running a long time. This way I can move on to other work. It will remain in "working" mode for a while, could be weeks not sure. Diff uploads will be in intermittent batches. -- GreenC 15:13, 9 July 2024 (UTC)
- No worries. It is a big request after all. MrLinkinPark333 (talk) 20:49, 9 July 2024 (UTC)
- MrLinkinPark333, I need to scale back the number of parallel processes to 2 because it's slow and tying up my rate-limited slots at the archive. This works but it will be running a long time. This way I can move on to other work. It will remain in "working" mode for a while, could be weeks not sure. Diff uploads will be in intermittent batches. -- GreenC 15:13, 9 July 2024 (UTC)
Bot results:
- NOTE: List of URL discoveries for future reference, or in case anyone needs it.
Enwiki: Done in 5 chunks, total was 8,673 pages
- Checked 1,000 pages and edited 804 pages. Moved 454 links to a new URL. Removed 1
{{dead link}}
templates. Added 4{{dead link}}
. Switched 164|url-status=dead
to live. Switched 97|url-status=live
to dead. Added 359 archive URLs (290 Wayback). Changed 106 citation metadata fields.
- Checked 1,000 pages and edited 910 pages. Moved 604 links to a new URL. Removed 1
{{dead link}}
templates. Added 5{{dead link}}
. Switched 286|url-status=dead
to live. Switched 78|url-status=live
to dead. Added 333 archive URLs (274 Wayback). Changed 113 citation metadata fields.
- Checked 2,220 pages and edited 1,761 pages. Moved 1,172 links to a new URL. Removed 2
{{dead link}}
templates. Switched 457|url-status=dead
to live. Switched 170|url-status=live
to dead. Added 724 archive URLs (581 Wayback). Changed 180 citation metadata fields.
- Checked 2,220 pages and edited 1,731 pages. Moved 1,126 links to a new URL. Removed 1
{{dead link}}
templates. Added 3{{dead link}}
. Switched 450|url-status=dead
to live. Switched 145|url-status=live
to dead. Added 736 archive URLs (657 Wayback). Changed 184 citation metadata fields.
- Checked 2,233 pages and edited 1,734 pages. Moved 1,187 links to a new URL. Removed 4
{{dead link}}
templates. Added 5{{dead link}}
. Switched 490|url-status=dead
to live. Switched 167|url-status=live
to dead. Added 724 archive URLs (619 Wayback). Changed 198 citation metadata fields.
IABot DB: Checked 14,591 unique links and changed 13,886 which will propagate across 300+ wikis via IABot
Done -- GreenC 05:27, 15 July 2024 (UTC)
ghostarchive.org
Web archive provider GhostArchive is dead as of July 19. insource:ghostarchive insource:/ghostarchive[.]org/ = 66,000 pages to be converted/deleted. -- GreenC 21:20, 19 July 2024 (UTC)
- I just tried two links [3] from England national football team and [4] from YouTube and they both seemed to work? GrapesRock (talk) 03:14, 20 July 2024 (UTC)
- Same, GhostArchive works fine for me... don't know why GreenC is/was having trouble. Nex 🌐 📰 leave a message 04:05, 20 July 2024 (UTC)
- Interesting! https://ghostarchive.org/ doesn't work .. and this admittedly hearsay post that said it is defunct, with a specific reason: Special:Diff/1230523321/1235539197. If anyone finds more information, let me know. I won't do anything until the situation is understood. -- GreenC 05:01, 20 July 2024 (UTC)
- Looks like I'm still able to archive. For instance, I just archived https://example.com here.
- Edit: looks like 10 hours ago people were having similar issues on lostmediawiki forum, though they don't give an explanation. GrapesRock (talk) 06:38, 20 July 2024 (UTC)
- Can you access https://ghostarchive.org/ ? -- GreenC 17:55, 20 July 2024 (UTC)
- Yeah. I'd presume it's still down for you? GrapesRock (talk) 17:57, 20 July 2024 (UTC)
- Yes, down. It appears to be local to my Firefox browser, works on a different browser. The oddity is other people reported an outage. Maybe something with SSL certificate or DNS. -- GreenC 19:45, 20 July 2024 (UTC)
- Yeah. I'd presume it's still down for you? GrapesRock (talk) 17:57, 20 July 2024 (UTC)
- Can you access https://ghostarchive.org/ ? -- GreenC 17:55, 20 July 2024 (UTC)
- Interesting! https://ghostarchive.org/ doesn't work .. and this admittedly hearsay post that said it is defunct, with a specific reason: Special:Diff/1230523321/1235539197. If anyone finds more information, let me know. I won't do anything until the situation is understood. -- GreenC 05:01, 20 July 2024 (UTC)
- Same, GhostArchive works fine for me... don't know why GreenC is/was having trouble. Nex 🌐 📰 leave a message 04:05, 20 July 2024 (UTC)
Not done - false alarm -- GreenC 01:48, 21 July 2024 (UTC)
- Note: Ghostarchive keeps getting down at random times, i am thisisanusername123 on that forum
- The website showed "Unknown domain" for me Notrealname1234 (talk) 15:01, 22 July 2024 (UTC)
- It did for me also, I suspect something got corrupted with SSL (https) certificates, or cookies, or DNS cache, and then stuck in the local browser cache. If it happens again try clearing browser cache or a different browser. There are intermittent problems with WebCite and Archive.today also that effect some users and not others. -- GreenC 16:12, 22 July 2024 (UTC)
espn.go.com
The pages redirect to the espn.com domain per https://www.niemanlab.org/2016/08/espn-com-has-finally-replaced-espn-go-com-and-a-tweet-about-google-seo-may-be-part-of-why/ (though not all do).
28,843 pages GrapesRock (talk) 20:52, 7 July 2024 (UTC)
Soft-404 rules: over 100 - contact me for the list.
Bot results:
- Enwiki: done in segments of 3k to 8k pages
- (#1 to 5000) Checked 5,000 pages and edited 4,715 pages. Moved 11,901 links to a new URL. Removed 7
{{dead link}}
templates. Added 115{{dead link}}
. Switched 275|url-status=dead
to live. Switched 149|url-status=live
to dead. Added 1,219 archive URLs (887 Wayback). Changed 411 citation metadata fields. - (#13001 to 16000) Checked 3,000 pages and edited 2,830 pages. Moved 6,940 links to a new URL. Removed 3
{{dead link}}
templates. Added 50{{dead link}}
. Switched 166|url-status=dead
to live. Switched 63|url-status=live
to dead. Added 660 archive URLs (415 Wayback). Changed 260 citation metadata fields. - (#6001 to 13000) Checked 8,000 pages and edited 7,684 pages. Moved 17,056 links to a new URL. Removed 12
{{dead link}}
templates. Added 289{{dead link}}
. Switched 465|url-status=dead
to live. Switched 549|url-status=live
to dead. Added 3,661 archive URLs (2,989 Wayback). Changed 596 citation metadata fields. - (#1 to 5000 + #13001 to 16000) - reprocessed with new code updates: Checked 8,000 pages and edited 1,923 pages. Moved 2 links to a new URL. Switched 16
|url-status=dead
to live. Switched 529|url-status=live
to dead. Added 1,791 archive URLs (1,791 Wayback). Changed 546 citation metadata fields. - (#16001 to 22000) Checked 6,000 pages and edited 5,614 pages. Moved 12,433 links to a new URL. Removed 5
{{dead link}}
templates. Added 134{{dead link}}
. Switched 311|url-status=dead
to live. Switched 326|url-status=live
to dead. Added 2,610 archive URLs (2,017 Wayback). Changed 543 citation metadata fields. - (#1 to 8691) - different set: Checked 8,691 pages and edited 3,534 pages. Moved 2 links to a new URL. Switched 60
|url-status=dead
to live. Switched 476|url-status=live
to dead. Added 1,119 archive URLs (1,043 Wayback). Changed 1,633 citation metadata fields. - (#22001 to 28891) Checked 6,897 pages and edited 6,462 pages. Moved 14,724 links to a new URL. Removed 7
{{dead link}}
templates. Added 205{{dead link}}
. Switched 307|url-status=dead
to live. Switched 455|url-status=live
to dead. Added 3,393 archive URLs (1,776 Wayback). Changed 592 citation metadata fields.
- (#1 to 5000) Checked 5,000 pages and edited 4,715 pages. Moved 11,901 links to a new URL. Removed 7
- IABot DB: Processed 73,000 unique URLs and updated 16,000 in the database which will propagate to 300+ wikis via IABot. Note that an addition 30,000 links were not processed due to intractable problems with this domain and the Wayback Machine that is taking too long.
Done - This was a complex domain. It contains redirects, soft-redirects, soft-404s, crunchy-404s, and a lot of content drift (see WP:LINKROT#Glossary). The signals are in the URLs, page titles, and page body. I was able to discover over 100 rules, but many more in the "crunchy" realm went unaddressed, there are probably thousands. ESPN is a massive site. I focused on espn.go.com and it took over 10 days .. the espn.com domain is at least 7 times larger ie. 70 days of processing (est). It's beyond available resources and time it would probably require a team working full time for weeks, building a knowledge-base of how the ESPN site(s) are structured and changes over time. Then keeping up with new changes in the future. Unfortunately ESPN is not the only domain "too big to maintain" correctly. -- GreenC 20:22, 24 July 2024 (UTC)
social.techcrunch.com
There's server errors on pages in this domain, and removing "social" from the domain fixes them (and based on the archives of the two links they've had this server error since at least October 2023)
Examples (from Instagram):
- https://social.techcrunch.com/2021/05/04/instagram-adds-a-captions-option-for-stories-and-soon-reels/
- https://social.techcrunch.com/2020/11/14/this-week-in-apps-conservative-apps-surge-instagram-redesigned-tiktok-gets-ghosted/
1,818 pages GrapesRock (talk) 21:45, 7 July 2024 (UTC)
Done - Checked 1,812 pages and edited 1,809 pages. Moved 2,650 links to a new URL. Removed 336 {{dead link}}
templates. Added 3 {{dead link}}
. Switched 213 |url-status=dead
to live. Added 34 archive URLs (16 Wayback). Changed 1 citation metadata fields. -- GreenC 04:12, 26 July 2024 (UTC)
goo.gl
Google has announced it will no longer support goo.gl shortened links after 25 August 2025. We have quite a lot of these in use on the Wiki at present. It may be necessary to look at a project/bot to replace them with the lengthy URLs. Stifle (talk) 07:52, 19 July 2024 (UTC)
- Google URL Shortener is about the service. https://developers.googleblog.com/en/google-url-shortener-links-will-no-longer-be-available/ also says: "Starting August 23, 2024, goo.gl links will start displaying an interstitial page for a percentage of existing links notifying your users that the link will no longer be supported after August 25th, 2025 prior to navigating to the original target page." PrimeHunter (talk) 08:56, 19 July 2024 (UTC)
- Thank you. I'll triage this after ESPN, above, which will be a couple more days at least. URL shortening was supposedly disallowed on Enwiki. The interstitial page could interfere with the bot, I'll start migrating sooner than later. -- GreenC 15:32, 19 July 2024 (UTC)
- meta:Spam blacklist says
\bgoo\.gl\b(?!/maps\b).*
, allowing goo.gl/maps. I guess the rationale was that it can only redirect to Google pages which aren't blacklisted so it cannot be used to bypass a blacklist entry, and it was probably assumed that Google would keep it working as long as they keep the target online. I think there was a time where Google itself gave goo.gl/maps links without the user asking for url shortening, so it would have been annoying if goo.gl/maps was blacklisted. PrimeHunter (talk) 20:34, 19 July 2024 (UTC)
- Ah thanks for the background that explains the distribution of links. -- GreenC 05:30, 20 July 2024 (UTC)
- meta:Spam blacklist says
- Note from Google: "In the event the interstitial page is disrupting your use cases, you can suppress it by adding the query param “si=1” to existing goo.gl links."
- Thank you. I'll triage this after ESPN, above, which will be a couple more days at least. URL shortening was supposedly disallowed on Enwiki. The interstitial page could interfere with the bot, I'll start migrating sooner than later. -- GreenC 15:32, 19 July 2024 (UTC)
- insource:goo.gl insource:/([.]|\/)goo[.]gl/ = 4,000+ pages.
- Over 90% are to goo.gl/maps, and about 350 to images.google.com where they are incorrectly/uselessly in the
|image=
field of infoboxen. -- GreenC 17:40, 19 July 2024 (UTC)
- Over 90% are to goo.gl/maps, and about 350 to images.google.com where they are incorrectly/uselessly in the
PrimeHunter or Stifle: Question: do you know anything about Google shortened URLs https://g.co/kgs/P8bHo7 ? insource:g.co insource:/\/\/g[.]co/ = 303 pages. Any reason not to expand these? -- GreenC 04:40, 25 July 2024 (UTC)
- I don't, but they should be expanded as a general good thing anyway. Stifle (talk) 07:35, 25 July 2024 (UTC)
- Many of them are used inappropriately for Google searches as references, for example:
- Macken, John (2008). "The Autonomy Theme in the Church Dogmatics: Karl Barth and his Critics". Archived from the original on 2024-02-19. Retrieved 2021-04-21.
- The link redirects to https://www.google.com/search?kgmid=/m/0c4k5ns&hl=en-US&q=The+autonomy+theme+in+the+Church+dogmatics+John+Macken&kgs=f8767d431dbb6421&shndl=0&source=sh/x/kp&entrypoint=sh/x/kp. It would be better if somebody is willing to improve them manually but it takes time. PrimeHunter (talk) 10:02, 25 July 2024 (UTC)
- Yeah URL shortening hides the true nature of dodgy links which go unnoticed. I think it's a step forward to expand them so someone can more easily discover them. -- GreenC 20:21, 25 July 2024 (UTC)
Done
- for goo.gl -- Checked 4,024 pages and edited 3,970 pages. Moved 5,198 links to a new URL. Added 11
{{dead link}}
. Added 79 archive URLs (12 Wayback).
- There are still some unconverted, insource:goo.gl insource:/([.]|\/)goo[.]gl/ shows 140 pages remain. Reasons include: bot inability to parse; marked with a
{{dead link}}
(cites should be removed or refactored); embedded in an archive URL.
- There are still some unconverted, insource:goo.gl insource:/([.]|\/)goo[.]gl/ shows 140 pages remain. Reasons include: bot inability to parse; marked with a
- for g.co -- Checked 303 pages and edited 295 pages. Added 45 archive URLs (0 Wayback). Changed 3 citation metadata fields.
-- GreenC 00:44, 26 July 2024 (UTC)
popmani.se
Popmani.se has been usurped by a casino advertising/SEO spam operation AlexandraAVX (talk) 17:45, 23 July 2024 (UTC)
- Added to the usurpation queue: Special:Diff/1236087736/1236249622 -- GreenC 17:52, 23 July 2024 (UTC)
Done -- GreenC 00:12, 25 July 2024 (UTC)
blogs.msdn.com
There are ~100 pages with URLs like http://blogs.msdn.com/b/oldnewthing/archive/2004/09/02/224672.aspx. These are dead but can be fixed by extracting the date from the URL and rewriting it as https://devblogs.microsoft.com/oldnewthing/20040902-00 ("-00" is static), which points to a list of blog posts on that day. Often there's only one, in which case it's unambiguous, but occasionally there's two or more and you need to disambiguate them somehow. Comparing the citation title with the article title is one possibility. (The destination URL for that instance is https://devblogs.microsoft.com/oldnewthing/20040902-00/?p=37983 with a numeric ID that can't be found any way I know of). * Pppery * it has begun... 23:02, 8 July 2024 (UTC)
- User:Pppery, I will do this, but the disambiguation based on
|title=
, probably not. Trying to read and match citation information breaks the model of the bot, it would require a special parser, can be difficult and error prone, and it's probably not that many pages (a percentage of 100 pages), could be done manually by someone faster and more accurately. I could probably parse the HTML to see when there are multiple blog items and log those URLs so we know which articles and URLs need checking. -- GreenC 05:16, 14 July 2024 (UTC)
Done Checked 626 pages and edited 46 pages. Moved 34 links to a new URL. Switched 4 |url-status=dead
to live. Switched 3 |url-status=live
to dead. Added 13 archive URLs (2 Wayback).
- * Pppery *, the bot converted 34 URLs. Of those, 10 were converted to the index page with multiple blog entries; they can remain as-is or someone can manually convert them to the exact page, listed above. There were another 5 dead links at devblogs.microsoft.com/oldnewthing. There are still some URLs in search, these are prior archive URLs now set live, or the 5 dead ones. -- GreenC 18:13, 26 July 2024 (UTC)
- I've disambiguated the ones that need disambiguating. * Pppery * it has begun... 23:34, 26 July 2024 (UTC)
billboard.com/bbcom
Hello. Billboard URLs with /bbcom/ are broken. I found this redirects to that for Because You Left. However, the rest of links I tried do not redirect.The numbers in the URL also don't match so it's not a simple find and replace.
- HTTPS - 300.
- HTTP - ~5600.
- HTTP without www - ~220
- HTTPS without www - 25
These include links not in mainspace and ones that are already archived. Thanks! MrLinkinPark333 (talk) 23:47, 14 July 2024 (UTC)
- I checked WaybackMachine for Ghost redirects, like with ew.com, but no ghosts. I guess the solution will be to archive (and any working redirects). -- GreenC 05:18, 15 July 2024 (UTC)
Done - Checked 1,937 pages and edited 1,163 pages. Moved 26 links to a new URL. Removed 1 {{dead link}}
templates. Added 634 {{dead link}}
. Switched 8 |url-status=dead
to live. Switched 55 |url-status=live
to dead. Added 649 archive URLs (639 Wayback). Changed 207 citation metadata fields. -- GreenC 22:12, 26 July 2024 (UTC)
- Many of these URLs first died over 15 years ago, possibly 20 or more, thus there are gaps in archival coverage as evidenced by "Added 634
{{dead link}}
". It was pre-Wayback and pre-Archive.today automatic archiving. -- GreenC 22:19, 26 July 2024 (UTC)
- Many of these URLs first died over 15 years ago, possibly 20 or more, thus there are gaps in archival coverage as evidenced by "Added 634
pqasb.pqarchiver.com
Hello again. There are a ton of pqarchiver links broken. These articles can be found at ProQuest but fall into 2 categories:
- 1. URLs with /doc/ can be converted into ProQuest URLs. this is now here for Here, My Dear. Some URLs have a different source type, such as converting this goes here for 1916 Michigan Agricultural Aggies football team and ends in Historical Newspapers. Template:ProQuest can help with this. ProQuest 565997669 points to this and will redirect to the right link for the Aggies article.
- 2. URLs that don't have /doc/ can't be converted as the new URLs don't match the number. Such as this is now here for 1974–75 Buffalo Sabres season.These need regular archives as I can't predict what the number will be in the new URL..
Please note that not all of these links are in articlespace. Also, some of these already have archived copies. Since it's so much, I don't mind if only the /doc/ URLs are focussed on and the other URLS dealt with later. Thanks again! MrLinkinPark333 (talk) 01:55, 17 July 2024 (UTC)
12,618 pages. Need to process all, to filter on the "/doc/" set, so I'll check the others for redirects, 404s, and https, hopefully not too many problems. If it gets messy will consider just focusing on /doc/ -- GreenC 22:34, 26 July 2024 (UTC)
MrLinkinPark333: I'm not sure which is better: archive or live link. The archive version has more information. They are both paywalls. My instinct is to leave the /doc/ pages as is, and add archives if not already. The entire pqasb.pqarchiver.com is dead. I'll start adding archives, and can go back for the /doc/ pages if decided. -- GreenC 05:06, 27 July 2024 (UTC)
- I think the /doc/ ones should be converted over since live links are available. ProQuest is available via The Wikipedia Library, so they are accessible without an account. The link for Here, My Dear in #1 works without a paywall. I don't know why some are and some are not. If they are converted, then perhaps each link can have added |url-access=subscription |via=ProQuest. MrLinkinPark333 (talk) 13:49, 27 July 2024 (UTC)
- The 'here my dear' gives a paywall, for me, you may be logged in somehow, try incognito. If you really think it should be done I'll do it. I don't want to get into adding new fields because that can get really complicated, I'll leave that for another bot that knows how to deal with conflicts with other fields and so on, like identifiers. Citation bot might have rules for it. -- GreenC 14:42, 27 July 2024 (UTC)
- Well, this may solve itself. They have rate limiting enabled via CloudFlare which is blocking my bot, checking the new link is live. I tried various techniques to get around but none work. I could do a blind move, search-replace the URL and not test, but with 4,000 links, the odds of problems are high, and I don't know what the error rate is. I can provide a list of the URLs, if anyone wants to work on testing headers return status 200. -- GreenC 17:02, 27 July 2024 (UTC)
- What I've found is that any links with /doc/ seems to match up with the new URL. Though I can't predict the source type. Otherwise, URLs without /doc/ don't seem to match up. See for instance this with that for Cultural impact of Mariah Carey. MrLinkinPark333 (talk) 18:28, 27 July 2024 (UTC)
- Source type is added automatically by the server side only need https://www.proquest.com/docview/283783667 works. Why don't we do this. I'll just manually check 50 URLs and see what the error rate is. -- GreenC 18:48, 27 July 2024 (UTC)
- ..checked 50 and error rate is 0%. That gives me more confidence to move forward with a blind move. -- GreenC 18:53, 27 July 2024 (UTC)
- Sounds good. Much easier to let the sourcetype be added automatically in the URL when viewing the link without it. MrLinkinPark333 (talk) 19:09, 27 July 2024 (UTC)
- ..checked 50 and error rate is 0%. That gives me more confidence to move forward with a blind move. -- GreenC 18:53, 27 July 2024 (UTC)
- Source type is added automatically by the server side only need https://www.proquest.com/docview/283783667 works. Why don't we do this. I'll just manually check 50 URLs and see what the error rate is. -- GreenC 18:48, 27 July 2024 (UTC)
- What I've found is that any links with /doc/ seems to match up with the new URL. Though I can't predict the source type. Otherwise, URLs without /doc/ don't seem to match up. See for instance this with that for Cultural impact of Mariah Carey. MrLinkinPark333 (talk) 18:28, 27 July 2024 (UTC)
- Well, this may solve itself. They have rate limiting enabled via CloudFlare which is blocking my bot, checking the new link is live. I tried various techniques to get around but none work. I could do a blind move, search-replace the URL and not test, but with 4,000 links, the odds of problems are high, and I don't know what the error rate is. I can provide a list of the URLs, if anyone wants to work on testing headers return status 200. -- GreenC 17:02, 27 July 2024 (UTC)
- The 'here my dear' gives a paywall, for me, you may be logged in somehow, try incognito. If you really think it should be done I'll do it. I don't want to get into adding new fields because that can get really complicated, I'll leave that for another bot that knows how to deal with conflicts with other fields and so on, like identifiers. Citation bot might have rules for it. -- GreenC 14:42, 27 July 2024 (UTC)
Enwiki in two batches:
- Batch 1: Checked 6,000 pages and edited 4,582 pages. Moved 1,716 links to a new URL ("doc" moves). Removed 58
{{dead link}}
templates. Added 754{{dead link}}
. Switched 453|url-status=dead
to live. Switched 443|url-status=live
to dead. Added 4,978 archive URLs (2,501 Wayback). - Batch 2: Checked 6,638 pages and edited 5,656 pages. Moved 1,708 links to a new URL ("doc" moves). Removed 86
{{dead link}}
templates. Added 845{{dead link}}
. Switched 338|url-status=dead
to live. Switched 180|url-status=live
to dead. Added 7,407 archive URLs (3,278 Wayback).
IABot db: Modified 54,000 URLs. Uploaded new archive URLs. Set status to dead. Unable to do the "/doc/" moves because IABot does not currently support URL moves. The changes will propagate to 300+ wikis.
Done -- GreenC 16:15, 30 July 2024 (UTC)
forumromanum.org
WP currently has 291 links beginning with "http://www.forumromanum.org/"; but the siteowner deleted the site, and the domain name has fallen into the hands of a scammer. (Do not go that link! you will get a browser freeze with a scam message telling you to call a phone number etc.)
Fortunately, the site — a very valuable source of classical Latin texts, often not to be found elsewhere online — was backed up at WebArchive, and as far as I can tell all its pages can be reached by a simple substitution in the URL. The global-replace pattern is:
replace
by
I have first-hand experience of the scammer splash screen; and, separately, confirmation of the removal of the site by David Camden the original site owner himself. For privacy, I'm of course not posting his e‑mail or details, but I can be reached by e‑mail to confirm privately if need be.
Bill (talk) 20:29, 22 July 2024 (UTC)
- No problem, have done this before per procedure WP:USURPURL. Sounds like a nasty malware site I have increased the priority. -- GreenC 20:49, 22 July 2024 (UTC)
- On behalf of many people out there, thank you! Bill (talk) 20:52, 22 July 2024 (UTC)
- https://urlscan.io/result/81c76ba4-c427-49d1-bd4c-e446e3d16cc7/
- Seems like it's a parking page now. Notrealname1234 (talk) 21:15, 22 July 2024 (UTC)
- oh wait, still a malware attack. Notrealname1234 (talk) 21:18, 22 July 2024 (UTC)
- On behalf of many people out there, thank you! Bill (talk) 20:52, 22 July 2024 (UTC)
Done in a JUDI batch with 27 other domains usurped: Special:Diff/1236485623/1236486118. -- GreenC 00:11, 25 July 2024 (UTC)
- Nifty! and nice detective work on your part, fixing 27 other problems as well! Bill (talk) 14:06, 29 July 2024 (UTC)