Jump to content

Wikipedia:Link rot/URL change requests/Archives/2019/December

From Wikipedia, the free encyclopedia


www.msnbc.msn.com → www.nbcnews.com

Much content moved from MSNBC to NBCNews.com.

msnbc.com forwards links to their old friend, but somehow msnbc.msn.com fails to do so:

There are too many of these nonforwarding (and forwarding) links for manual operations.

Urgent:

  • Replace links to [http[s]://][www.]msnbc.msn.com/id/9999999[/[foo]][/][#bar] with links to http://www.nbcnews.com/id/9999999/ (where 9999999 is a placeholder for 7 digits, or possibly n digits). A quick search for "msnbc.msn.com/id" finds 287 articles in the [article] namespace with broken link(s) (3,937 pages, counting all namespaces.)

Less urgent:

  • Replace links to [http[s]://][www.]msnbc.com/id/9999999[/[foo]][/][#bar] with links to http://www.nbcnews.com/id/9999999/. A quick search for "msnbc.com/id" finds 31 articles in the [article] namespace with broken link(s) (234 pages, counting all namespaces.)

Microsoft has casually stolen linkage and traffic from NBCNews. That makes it especially important to give them back to NBCNews. -A876 (talk) 03:29, 30 October 2019 (UTC)

@A876: thanks for the report. Microsoft hijackers, who knew. This will require checking both old and new URLs for functionality, unwinding or adding any archive URLs and {{dead link}}, changing |publisher= and |work= fields. And typically there are other edge-case problems that show up. This will take some time to program and test out and I'm in the middle of other projects but will get to this. -- GreenC 14:30, 30 October 2019 (UTC)
@A876: - Turns out msn.com (and msnbc.com) is a huge can of worms. Most of the sub-domains are dead and there are 100s of them (eg. sport.be.msn.com). These can't be redirected to NBC as they have no ID but make up 10s of thousands of URLs. I set them to "Blacklisted" in the IABot and ran the bot on those pages (this took me 4 days due to the limit of 5 queues running at once). Ok so then we have the "www" domains as you originally posted above, and some of those have an ID - over 15,000 not the few hundred originally thought. These were converted to NBC where possible, and a large number had a secondary redirect (from NBC to Today.com) and those were discovered and added. Those that couldn't redirect had archive URLs added. There were also soft-404 problems at NBC, those were discovered the best I could because NBC has a tight cookie blocker that triggers a block after viewing more than a few pages in a row. Anyway I have no idea how many URLs I saved but guessing over 30,000 in at least 15,000 articles, possibly more. MSN was a total mess, they seem to have abandoned most of their properties, thanks for bringing this to WP:URL. -- GreenC 18:23, 22 November 2019 (UTC)

Wow. I'm glad I asked and I'm glad you clobbered so many. I wanted to look at the edits you made your bot do, but I didn't know its name. (I found it indirectly, the obvious-in-hindsight User:GreenC bot.) I see that you handled many more cases, such as linking to Archive for pages that show up as "removed" on the new domain. Now I'm going to whine a little. When the old link doesn't work, and I humanly see that it is probably gone forever, I go with "url-status=unfit", not "dead", because {cite} shows even useless "dead" URLs and hides useless "unfit" URLs. Also, I feel compelled to use https://www.nbcnews.com/... rather than http://www.nbcnews.com/... for the same reasons that HTTPS Everywhere was created, even though the first one currently does not rewrite to the second one. The important thing was to rescue them before they all got deleted or archived instead of being live-linked where possible, and that's great. Thanks. - A876 (talk) 00:33, 23 November 2019 (UTC)

A lot of edits were also done by User:InternetArchiveBot with "GreenC" in the edit summary. |url-status=usurped (ie. unfit) is for special cases where the URL is to a spam or otherwise inappropriate site. Normally it should continue displaying the original URL even if dead. This per the docs, there are probably reasons. The https redirects to http, if lucky, safer to use http it ends up there anyway. -- GreenC 01:42, 23 November 2019 (UTC)

I think the "docs" on url-status are a little vague. "When the original URL has been usurped for the purposes of spam, advertising, or is otherwise unsuitable, setting |url-status=unfit or |url-status=usurped suppresses display of the original URL ...". Values of "unfit" and "usurped" have identical effect, but they report distinct conditions. Obviously "usurped" means "has been usurped for the purposes of spam [or] advertising", and "unfit" means "is otherwise unsuitable". (Although they vary the order in the description.)

The default value, (url-status not specified), is definitely suitable for a link that is untested and might be live, might be usurped (spam or advertising replacement site), or might be unfit (serving other content, 404 page, or domain not found) – because it shows both links, which is the right thing to do, because the original link might be live. The mere presence of "archive-url" does not say whether "archive-url" was added because the original link is broken or in case the original link becomes broken (preemptively). Any editor should replace (url-status not specified) with "url-status=live", "url-status=dead", "url-status=unfit", or "url-status=usurped", as soon as they notice which one applies.

Unfortunately, "the default value is |url-status=dead". Because of this, specifying "dead" has the same effect as (url-status not specified). That is where they went wrong. If the original link really is dead, it should not be shown. But (url-status not specified) shows both links, because it has to. They should instead make "dead" hide the original link, the same as "unfit" and "usurped"; and let "live" and (url-status not specified) show the original link. They could accomplish this by making the default value of url-status be "unknown" or "" (null string); anything but "dead". With this in effect, everything would be comfortable. Wherever "url-status=dead" or "unfit" or "usurped", the original URL would be hidden; "dead" means that "unfit" or "usurped" would apply, but no determination was made. (Is it even important?) Wherever "url-status=unknown" (or "" or the new default value), it means that although archive-url was added, a determination of "live" or "dead" (or "unfit" or "usurped") was not made. Until the url-status system is made reasonable, I cannot see fit to use the value "dead", because it inappropriately shows the broken link; I am compelled to enter "unfit" or "usurped" (or even "live", if "dead" was previously added in error).

I mixed up a detail. I have seen sites that respond to both http and https without rewriting, but nbcnews.com is not one of them. nbcnews.com rewrites https → http (a rare and somewhat insane reaction), so http is the preferred link; you got it right. - A876 (talk) 07:23, 24 November 2019 (UTC)

@A876: I think where you are at odds is "it inappropriately shows the broken link" when dead. You have not made a case why this is inappropriate, or heard the counter-arguments why it is appropriate. Suggest posting at Help_talk:Citation_Style_1 where it is designed/discussed and until there is consensus for your position please do not edit counter to consensus there may be things you are not aware of, why things are done as they are. -- GreenC 13:18, 24 November 2019 (UTC)

[Preface: I'm raging at the system, not you.] Dare I assume that "dead" means dead? (I cannot do otherwise.) Thus it is beyond obvious that it is inappropriate to show a clickable link for a URL that has been marked "dead". (Unless you distrust every existing "dead" marking, in which case they should have been globally replaced with "maybedead", or similar.) Do you actually propose that I have to "make a case why [having cite templates make every broken URL that is marked "dead" clickable, wasting every reader's time] is inappropriate"? (I'm not editing heavily, so it won't be a big factor.) Have I really been "editing counter to consensus", which "consensus" I might never manage to discover, even searching and reading yards of off-the-beaten-trail talk-pages and talk-page archives that discuss a template I don't even use? I looked at the crude documentation for {{cite web}}. I don't know how you got your interpretation thereof; you imply that some "real" directions are elsewhere. To me, someone came up with a possibly clever scheme, but botched the implementation and/or didn't document it clearly and where needed. I think I deserve a section-specific cite to a clear, concise declaration of the features and how to use them correctly.

If you think I have described a probable deficiency and practible solution(s), then maybe you can advise where and how I might suggest it. (After I study any advice you have linked.) (If you agree strongly, consider digesting it into something that fits the status quo and posting it yourself. ("Take" "my" "idea", please.)) I am clear in my understanding that any usurped URL may be marked "usurped", and any unfit URL may be marked "unfit". A "dead" URL is one that no longer returns the content that it did when it was added – therefore every "dead" URL is one that has either been "usurped" (typosquatting, etc.), is otherwise "unfit" for linking (domain-not-found, 404, reused for unrelated content, etc.). Therefore I see no problem replacing "dead" with "usurped" or "unfit" (as the case may be). Normally, I would feel minimal urgency to disambiguate "dead", because (so far) it matters little to me why a URL is "dead". (Replacing "live", "dead", or "unfit" with "usurped" can be informative. It can usefully be done en masse by a bot whenever a domain gets blacklisted on Wikipedia and is confirmed for such tagging.) However, given that "dead" is handled most incorrectly (it hyperlinks dead URLs! unacceptable!), I currently feel some urgency to [legitimately] replace "dead" with "usurped" or "unfit", in order to avoid the pointless sharing of dead links that results from using "dead". If "dead" would be handled correctly (the same as "usurped" or "unfit"), that would remove my urgency to reclassify "dead" to "usurped" or "unfit". The easiest way to do that is to treat links that have url-status of [blank] (or other unrecognized value) as if their url-status is "unknown" and show both links; and make "dead" hide the dead link just like "usurped" and "unfit" do. That should work – unless someone has globally replaced url-status of [unspecified] with "dead", in which case they have destroyed the meaning of "dead", having mismarked thousands of links that should be "live" as "dead"; in which case I don't know what to say; major damage has been done that is only easily fixable by redefining "dead" here to mean "maybe dead maybe live", in which case no one should ever enter "dead" for any dead link; they must enter either "usurped" or "unfit", to make sure that the dead URL is not shared as a link.

Alternate: If we regard "dead" to only mean "domain not found", then it is reasonable to link both URLs, because the "outage" might be only temporary. This interpretation would make the use of "dead" for URLs that are known to be working but unfit (404, etc.) unacceptable.

(This situation is a little better than over at Quora, where anonymous unreachable moderators incorrectly enforce secret policies. For example, an edit done in agreement with their published policies and recommendations gets accepted by one moderator, and then the same edit is reverted by another editor citing "policy", and there is with no place to reply.) - A876 (talk) 19:40, 25 November 2019 (UTC)

@GreenC: (Sorry for belatedly pinging you with this 2nd reply here.) I have refocused on a serious situation. (Maybe I should post it elsewhere, but I don't know where.) I don't know whether you parsed my previous reply above that states it somewhat convolutedly among other topics. Just now I wrote a possibly clearer fresh statement at User talk:John B123#dead/unfit/usurped URLs, but that's a little long too. (Due to the crappy documentation at {{cite web}}, another editor disagrees as to what "dead"/"usurped"/"unfit" mean and how they should be used. Based on my reading of {{cite web}}, I changed "url-status=dead" to "url-status=unfit" because "unfit" is more specific and, most importantly, "unfit" hides the worthless broken URL. But someone changed it back, as if showing broken URLs is somehow correct, preferred, or beneficial. I am NOT looking for side-taking on this. It got me thinking again about the bigger issue of hiding broken URLs.)

This fresh summary seems more concise:

I think that {cite web} etc. should change "url-status=dead" to hide the broken URL, the same as "url-status=usurped" and "url-status=unfit" do. That change would hide a million broken URLs, a huge benefit to Wikipedia users. But that creates a new problem. When "url-status=" (or not specified at all), templates {cite web} etc. all act the same as for "url-status=dead". So their default response must also be changed. When "url-status=" (or not specified at all), we don't know whether the editor should have specified it as "live" or "dead"/"usurped"/"unfit", so the templates must also be changed to act as if they see "url-status=unknown" or similar (a new "default" value), and show both URLs. That seems a complete fix, EXCEPT: If anyone specified "url-status=dead" for URLs that could be live, or expecting the "dead" URL to be shown as a link, then they have created a possibly huge number of mis-tagged URLs. This seems unlikely, but it is conceivable that some humans or bots replaced "url-status=" (or not specified at all) with "url-status=dead" because it is the default (despite "dead" not being the original intent); in that case errant edits should be tracked down and corrected. (If there is no way to find and correct all such errors, a decision must be made whether to accept the errors (a few thousand live URLs hidden) or abandon this approach.)

If the above fix is not workable, it comes with acceptance of a huge degradation of Wikipedia data. In this case, "url-status=dead" must always show both URLs (millions "live", millions broken). The only way to hide the broken URLs would be to disambiguate each "dead" to "live", "usurped" (indicating spam or worse), or "unfit" (404, other content, domain-not-found). The documentation would have to clarify that "dead" is not preferred, must not be used for new entries, and should be replaced with "live", "usurped", or "unfit". (Then it would be optional to mass-rework thousands of bot-edits that could have specified "unfit" rather than "dead".) (And it is simply ugly that "dead" effectively means "unknown" or "unspecified". Maybe another mass-edit could replace every "dead" with "unspecified".) - A876 (talk) 05:29, 8 December 2019 (UTC)

A876, no offense but I'm not sure why you are lobbying me, I have nothing to do with it nor any control over it nor any desire to make a change. I already told you where the appropriate forum is to discuss this. -- GreenC 16:22, 8 December 2019 (UTC)

@A876: In the Wizard of Oz (1939 film), speaking of the death of the Wicked Witch of the East, the coroner averred: I thoroughly examined her: And she's not only merely dead: She's really, most sincerely dead!
Reaching the conclusino that an url is really dead is difficult to conclude from a single observation. With "usurped" and "unfit", we can conclude that some page was served up, and the page that was served wasn't appropriate. We have a good deal more confidence that this is really dead. Once you've got the archive link in there (along with the "dead" status), things don't change much, except people aren't checking the original url very much.
Please explain again what is the big problem with the existing scheme that makes it so evil? Fabrickator (talk) 13:00, 9 December 2019 (UTC)

Change fancyclopedia.wikidot.com to fancyclopedia.org

Until now Fancyclopedia has been accessible using http://fancyclopedia.org and the deprecated http://fancyclopedia.wikidot.com, but now we've ported it to mediawiki, the latter form will stop working. Could someone write a bot to change all references of 'fancyclopedia.wikidot.com' to 'fancyclopedia.org' across the site. I don't want to learn how to write Wikipedia bots myself. Vicarage (talk) 13:14, 18 December 2019 (UTC)

You can also ask inclusion in the m:Interwiki map, so the next time we only need to change the URL centrally. Nemo 14:29, 18 December 2019 (UTC)

@Vicarage: Unless I am missing something it only appears in 8 articles.[1] -- GreenC 01:12, 19 December 2019 (UTC)

I saw more, but I think that was the problem of searching for urls. I will fix manually. Vicarage (talk) 13:47, 20 December 2019 (UTC)