User talk:EarwigBot/Copyvios/Exclusions

Protected edit request on 30 October 2014

This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request.

Please replace all <tt>...</tt> with <code>...</code> or <kbd>...</kbd> as is appropriate on a case-by-case basis.
Please replace all of the http:// in the internal Wikipedia sites section to https?:// since Wikipedia uses a secure server.

Thank you. — {{U|Technical 13}} ^{(e • t • c)} 17:02, 30 October 2014 (UTC)[reply]

Should probably ping The Earwig here as well. Thanks again. — {{U|Technical 13}} ^{(e • t • c)} 17:04, 30 October 2014 (UTC)[reply]

Re 1: Done. I just replaced all the tt tags with code tags... not sure what else you would have wanted?
Re 2: "Exceptions are protocol-insensitive (so the rule http://wiki.riteme.site will match https://wiki.riteme.site/wiki/Foo)", so this is not necessary and would break the rules since they are not regular expressions without the re: prefix. — Earwig ^talk 18:53, 30 October 2014 (UTC)[reply]

common blacklist?

Hi! I noticed that you're maintaining a blacklist for EarwigBot and you the copyvio tool on Labs, and there's User:EranBot/Copyright/Blacklist another being maintained for User:EranBot, which User:ערן and User:Doc James have been working on lately. Would it be feasible to work from a common blacklist? I noticed a bunch of mirrors (covered by EranBot's blacklist) coming up when I tried out the copyvio tool.--ragesoss (talk) 00:20, 12 December 2014 (UTC)[reply]

So does the black list for EarwigBot use the same format and is it also a collection of mirrors of Wikipedia? Doc James (talk · contribs · email) 01:10, 12 December 2014 (UTC)[reply]

User:Doc James: The format is a little different (the user page of this Talk page is it), but they are both essentially just lists of regexes, so it might be simple to modify EarwigBot to use the much larger list of mirrors we've compiled so far for EranBot. (I anticipate using the same mirror list for Wiki Ed's plagiarism prevention system.)--Sage (Wiki Ed) (talk) 01:14, 12 December 2014 (UTC)[reply]

This is a good idea; I wasn't aware there was another list of mirrors (and I should really watch this page more often!). I've created an issue for it which I'll handle soon. — Earwig ^talk 18:02, 23 January 2015 (UTC)[reply]

Done now. The bot uses User:EranBot/Copyright/Blacklist too. — Earwig ^talk 00:19, 27 January 2015 (UTC)[reply]

Add http://www.reference.com/

E.g. http://www.reference.com/browse/Ganesha --Redtigerxyz ^Talk 10:52, 26 December 2014 (UTC)[reply]

Done. — Earwig ^talk 18:02, 23 January 2015 (UTC)[reply]

+

Please modify quickiwiki.com/en to quickiwiki.com, as there is support for other language wikipedias. i.e. 1, 2

— Revi 06:49, 23 January 2015 (UTC)[reply]

@The Earwig: Ping. — Revi 06:49, 23 January 2015 (UTC)[reply]

Done. — Earwig ^talk 18:02, 23 January 2015 (UTC)[reply]

Add http://www.donehealth.com/

--Redtigerxyz ^Talk 18:15, 7 February 2015 (UTC)[reply]

Add http://www.questpedia.org

Thanks! --AlessioMela (talk) 11:39, 2 March 2015 (UTC)[reply]

Edit request

This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request.

Add "url = http://*.gpo.gov" This is the United States Government Printing Office, which prints offical versions of US documents, it's all freely licensed as US government publications. Thanks! Kharkiv07 ^Talk 21:35, 3 April 2015 (UTC)[reply]

Done Rjd0060 (talk) 23:26, 25 April 2015 (UTC)[reply]

Edit request

This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request.

Please add "url = http://*.usgs.gov". As with gpo.gov, above, USGS is primarily works of the US government and public domain.

This article creation was flagged as a potential copyvio. TJRC (talk) 00:40, 26 June 2015 (UTC)[reply]

Added — Martin (MSGJ · talk) 10:14, 2 July 2015 (UTC)[reply]

New exclusions

This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request.

Russian Wikipedia mirrors:

rfwiki.org
gruzdoff.ru
dic.academic.ru/dic.nsf/ruwiki/*
www.wikiwand.com

Thanks in advance! --Fastboy (talk) 11:10, 15 August 2015 (UTC)[reply]

Done Nakon 22:51, 15 August 2015 (UTC)[reply]

More exclusions

This edit request has been answered. Set the |answered= or |ans= parameter to no to reactivate your request.

Freely licensed, available under CC 4.0 (https://creativecommons.org/licenses/by/4.0/deed.ru):

Clones:

--Fastboy (talk) 10:23, 18 August 2015 (UTC)[reply]

Done, thank you! — Earwig ^talk 03:12, 19 August 2015 (UTC)[reply]

Excluding project Gutenberg

We should probably exclude gutenberg.org as a public domain source as well (as I've seen it show up in false positives). Kaldari (talk) 23:18, 24 August 2015 (UTC)[reply]

Done; I added gutenberg.org itself only and no subdomains. — Earwig ^talk 00:18, 25 August 2015 (UTC)[reply]

rfwiki.org

2 Exclusions. --Pessimist 10:00, 9 September 2015 (UTC)

[1]

Done. — Earwig ^talk 19:26, 13 September 2015 (UTC)[reply]

Exclusion

http://wreferat.baza-referat.ru Copy articles Russian Wikipedia.--Arbnos (talk) 18:08, 15 September 2015 (UTC)[reply]

Done. — Earwig ^talk 18:32, 15 September 2015 (UTC)[reply]

Add http://enzyklo.de

(Search engine) -- FriedhelmW (talk) 18:40, 1 October 2015 (UTC)[reply]

Done — Earwig ^talk 20:25, 1 October 2015 (UTC)[reply]

Exclusion

http://ensiklopedia.ru/wiki/* Copy articles Russian Wikipedia.--Arbnos (talk) 00:02, 8 November 2015 (UTC)[reply]

Done. — Earwig ^talk 00:28, 8 November 2015 (UTC)[reply]

Add http://encyclo.co.uk/

Thanks! -- FriedhelmW (talk) 19:01, 8 November 2015 (UTC)[reply]

Done. — Earwig ^talk 21:36, 8 November 2015 (UTC)[reply]

Add http://research.omicsgroup.org/

This website seems to copy Wikipedia articles in the format hxxp://research.omicsgroup.org/index.php/ARTICLENAME. Example: http://research.omicsgroup.org/index.php/Statue_of_liberty for Statue of Liberty. epic genius (talk) 02:25, 12 November 2015 (UTC)[reply]

Done. — Earwig ^talk 10:00, 12 November 2015 (UTC)[reply]

Add http://worldebooklibrary.net/articles/

Seems to be found a lot lately. Collect (talk) 23:24, 22 November 2015 (UTC)[reply]

Done. — Earwig ^talk 10:23, 23 November 2015 (UTC)[reply]

Exclusion

http://wikigraff.ru/ Copy articles Russian Wikipedia.--Arbnos (talk) 11:08, 27 November 2015 (UTC)[reply]

Done. — Earwig ^talk 14:53, 27 November 2015 (UTC)[reply]

Add http://nosmut.com/

This website seems to copy articles from Wikipedia... I think, although the home page is a bit weird. Anyway, http://nosmut.com/New_York_City_Subway.html, for example, duplicates New York City Subway. epic genius (talk) 04:25, 1 December 2015 (UTC)[reply]

Done. A strange site indeed. — Earwig ^talk 04:29, 1 December 2015 (UTC)[reply]

richestcelebrities.org

Appears to use Wikipedia for some material - alas. Compare Death of Antonio Calvo to http://richestcelebrities.org/richest-actors/antonio-calco-net-worth/ . Collect (talk) 16:18, 22 December 2015 (UTC)[reply]

Also add libreriauniversitaria.it for similar use of Wikipedia articles/ Thanks. Collect (talk) 16:22, 22 December 2015 (UTC)[reply]

First is done. @Collect: can you give an example of the second copying Wikipedia content? I can't find any. — Earwig ^talk 21:52, 22 December 2015 (UTC)[reply]

http://www.libreriauniversitaria.it/tom-riall-betascript-publishing/book/9786133017016 (appears to use material from Betascript Publishing, a repackager of Wikipedia - perhaps the filter should simply look for "betascript"?) should suffice. Collect (talk) 22:24, 22 December 2015 (UTC)[reply]

Done. — Earwig ^talk 23:14, 22 December 2015 (UTC)[reply]

Excluding some sites

This is probably me not knowing how to use the Copyvio % tool. I am having difficulty with Our Lady of the Good Success; it contains material virtually identical to http://[add www.]fisheaters.com/forums/index.php?topic=3468895.0, but the latter site copies Wikipedia and puts, withpout comment, https://wiki.riteme.site/wiki/Nuestra_Se%C3%B1ora_del_Buen_Suceso_de_Para%C3%B1aque at the bottom. I tried to add the site to Wikipedia:Mirrors and forks/Mno (section), but this was not accepted as fisheaters.com is blacklisted. So the question I ask, which is due either to me missing something obvious, a bit of documentation that needs adding, or even an actual limitation of the Copyvio tool: How can specific sites be excluded, based on user reqiremwents?

I have recently started using this tool; a scenario I find is that a site which is copied from Wikipedia obviously comes up as t he prime suspect, obviously shouldn't be considered, and masks true copyedits from other sites. I would like to be able to click a found site (without going through the procedure of adding it to a permanent list) and for it then to be ignored in subsequent searches. Is this possible somehow? Also, it might make sense to exclude from the search (perhaps controlled by ab option) sites which are blacklisted by Wikipedia.

Apologies for taking your time with what is probably user inexperience of the tool, best wishes, and congratulations on a clever and much-needed tool, Pol098 (talk) 12:45, 24 December 2015 (UTC) P.S. I can't save this message with a valid link to fisheaters. com in it! Surely this should be OK on a Talk page?[reply]

@Pol098: this will happen frequently for pages that have an established history, where they are widely copied around the Internet. The tool works best on new pages. In this case, I don't usually like adding mirrors to this list that only mirror a single page, because then the list would become too large and unmaintainable. You should be able to simply ignore the mirror result and review any other matches the tool finds, using the direct compare option. — Earwig ^talk 19:22, 24 December 2015 (UTC)[reply]

Thanks for response. I take your point about new pages, which isn't what I've been looking at. I did add one site to your list; I got the impression that it had a lot of material from Wikipedia, not just a one-off (if I remember rightly, I was editing several articles); you may wish to remove it, in which case I apologise for the unwanted addition. As comment—no need to respond—a page with 95% match because it's a copy from Wikipedia is a nuisance, and can mask others. The comparison often reports to the effect that no more pages will be compared because there are already a lot of hits. What I would like to see, but may not be of general uses or sensible to implement, is a way to implement a one-off temporary list of sites not to be checked in this particular case, or/and a way to tag listed sites with matches so that they are excluded from a later run. I've found an awful lot of long-standing pages with many edits that are crammed with swathes of copied text. I haven't been seriously using the tool for long, and may be talking nonsense: if so, ignore. Best wishes, Pol098 (talk) 19:48, 24 December 2015 (UTC)[reply]

hearplanet.com

http://www.hearplanet.com/article/930747 uses Wikipedia as a source Collect (talk) 09:39, 31 December 2015 (UTC)[reply]

Done. — Earwig ^talk 09:47, 31 December 2015 (UTC)[reply]

Also note a bunch of users on youtube.com quote Wikipedia a lot - and as youtube is not a reliable source in any event - probably should be excluded. Collect (talk) 09:40, 31 December 2015 (UTC)[reply]

I don't think this is a good blanket exclusion. Being a reliable source is mostly irrelevant (ELPEREN aside); it's more about whether the site has the potential for people to copy from it, and I think the answer there is yes. You're probably right that the reverse is much more common, but I'd rather people need to wade through some false positives than get a false negative. — Earwig ^talk 09:47, 31 December 2015 (UTC)[reply]

Wondering - could one simply add "wikipedia" to be a marker for not showing a result? Many of these do have "wikipedia" somewhere in their source code <g>. Collect (talk) 09:43, 31 December 2015 (UTC)[reply]

Right now it automatically excludes pages that link back to Wikipedia—I skipped just a text search to avoid rare false negatives—but that might be worth looking into more... — Earwig ^talk 09:47, 31 December 2015 (UTC)[reply]

Hold on, I noticed hearplanet.com does that already. That's a mistake... will take a look in the morning. — Earwig ^talk 09:48, 31 December 2015 (UTC)[reply]

Long morning. Fixed now; auto-excluding pages that link directly back to the article being searched. Still a bit strict, but should help somewhat. — Earwig ^talk 10:26, 15 January 2016 (UTC)[reply]

turnitin

Alas - gives all the false positives which this list tries to avoid - can its results be tweaked to avoid long lists of "99% of words copied"? In fact maybe the folks there should be given the suggestion that once "wikipedia" is referenced in the source, that it not be listed as a violation separately from the Wikipedia violation? Collect (talk) 16:47, 22 January 2016 (UTC)[reply]

I'll look into this; I really want to change the way the turnitin output is shown. Ideally we just use it as a source for URLs like the search engine, but the reason the WMF chose not to do that when submitting their patch is that many turnitin results are behind paywalls. — Earwig ^talk 20:14, 22 January 2016 (UTC)[reply]

datab.us

http://datab.us/i/Wichita,%20Kansas is very suspiciously like the Wikipedia article (like about 100%) - and gives no attribution. It does use commons images - meaning I have no doubt this is an unattributed copy of Wikipedia. Thanks. Collect (talk) 21:59, 4 February 2016 (UTC)[reply]

Added. — Earwig ^talk 00:19, 5 February 2016 (UTC)[reply]

wikitree.com

In fact, can you do a general exclusion of all sites beginning with "wiki" at all? there appear to be a bunch of them, and it might same some time in the exclusion process. Thanks. Collect (talk) 13:06, 14 February 2016 (UTC)[reply]

I don't know. Would need to do further research on how often sites with "wiki" in their name are not mirroring Wikipedia, because it could reasonably happen, though because the tool reports what sites were excluded now it's less likely to be an issue. — Earwig ^talk 20:34, 14 February 2016 (UTC)[reply]

my-definitions.com

Please add http://my-definitions.com/fr/definition/ that copy lot of french WP articles (ex: [2] ans analyse [3]). Thanks you --Framawiki (talk) 14:52, 1 April 2016 (UTC)[reply]

Done. — Ǝɐɹʍıƃ ^ʇɐlʞ 21:15, 1 April 2016 (UTC)[reply]

Add http://fr.academic.ru/dic.nsf/frwiki/*

Can you add this website in the exlusion : http://fr.academic.ru/dic.nsf/frwiki/* ? Thanks ! --Bastenbas (talk) 13:39, 2 April 2016 (UTC)[reply]

Done. — Earwig ^talk 15:14, 2 April 2016 (UTC)[reply]

Add http://lanimalchat.com/index.html

Hello, this website use the contents of wikipedia fr --Bastenbas (talk) 12:41, 3 April 2016 (UTC)[reply]

Done. — Earwig ^talk 19:46, 3 April 2016 (UTC)[reply]

Gutenberg.us

Uses "World Heritage Encyclopedia" which is Wikipedia as a source. Considered a "sham encyclopedia". Collect (talk) 22:54, 21 May 2016 (UTC)[reply]

Done. — Earwig ^talk 04:22, 22 May 2016 (UTC)[reply]

livingnewdeal.org/projects/ariel-rios-federal-building-murals-washington-dc/

Uses Wikipedia - but as a URL which might not necessarily be caught in "source Notes". Is the new check for "Wikipedia" on a page going to pick these URLs up? Thanks! Collect (talk) 23:27, 21 May 2016 (UTC)[reply]

I prefer to leave these cases for manual review, but maybe I'll try something more greedy. — Earwig ^talk 04:23, 22 May 2016 (UTC)[reply]

nekropole.info/en/Cary-Grant

Presumably uses a lot more - the phrase "Creative Commons" might actually be better than a simple look for "Wikipedia" on such sites, as this one only uses that phrase.

www.rusc.com/old-time-radio/Cary-Grant.aspx?t=256 actually credits Wikipedia. Collect (talk) 12:54, 2 June 2016 (UTC)[reply]

allstarpics.famousfix.com/pictures/chloe-madeley

Credits "wiki.riteme.site". Collect (talk) 14:45, 3 June 2016 (UTC)[reply]

Add youtube

Should youtube be added? It seems quite unlikely that an article would be copied from a youtube description or comment (but many Youtube videos, e.g. trailers and such, use Wikipedia articles in their descriptions) Intelligentsium 13:26, 30 June 2016 (UTC)[reply]

I don't think so, per my comments above in #hearplanet.com. Such a match generally warrants investigation. Exclusions are best suited for things that are always mirrors. — Earwig ^talk 13:27, 30 June 2016 (UTC)[reply]

Special:Diff/802116377

Will you be able to resolve this false result in which the two links are copying the Wikipedia article? -- 1989 17:14, 24 September 2017 (UTC)[reply]

add wikimapia?

http://wikimapia.org/terms_reference.html — Preceding unsigned comment added by Sergkarman (talk • contribs) 16:46, 19 February 2018 (UTC)[reply]

https://kids.kiddle.co/Brooklyn_Navy_Yard

@The Earwig: This website copied an earlier version of Brooklyn Navy Yard and actually credits Wikipedia. I'm not sure if there are other articles on the same site that copy from Wikipedia as well. epicgenius (talk) 18:30, 20 October 2018 (UTC)[reply]

Thanks, looks like a lot of WP-based articles, added. — Earwig ^talk 18:39, 20 October 2018 (UTC)[reply]

Add https://www.govinfo.gov/

Can this be added to the exclusion list? — Preceding unsigned comment added by Pdxdoglover (talk • contribs) 00:03, 21 January 2019 (UTC)[reply]

Pdxdoglover (talk) 21:23, 21 January 2019 (UTC)[reply]

vk.com

It's a Russian Facebook, users freely copy information from Wikipedia, which skews the Copyvio rate. For example, when I wrote this article in August 2017 https://tools.wmflabs.org/copyvios/?lang=ru&project=wikipedia&title=Пудовкин,_Денис_Евгеньевич , the rate of confidence was 14,5%, but then it rose to 67% after some vk.com user quoted the article extensively in one of her posts in September 2018. Could you add vk.com to the exclusion list, please? Arbeite19 (talk) 12:07, 9 April 2019 (UTC)[reply]

https://www.toursandtravel.app/en/points-of-interests/new-york/brooklyn-bridge-park/85

@The Earwig: I think the above page is almost entirely copy-pasted from an early version of Brooklyn Bridge Park without attribution. Compare the article version from 2015 and the above linked website. The only thing the other website did is to remove the "History" section. epicgenius (talk) 01:17, 12 July 2019 (UTC)[reply]

Yep, it's copying multiple articles. Added, thanks. — Earwig ^talk 04:15, 12 July 2019 (UTC)[reply]

https://www.cruisebe.com/morningside-park-new-york-city-ny

@The Earwig: this looks like it was sloppily copied from a previous version of Morningside Park (Manhattan). It even says at the bottom: Source: https://wiki.riteme.site/wiki/Morningside_Park_(New_York_City) epicgenius (talk) 20:09, 30 July 2019 (UTC)[reply]

Added, thanks! — Earwig ^talk 00:48, 31 July 2019 (UTC)[reply]

http://www.nyc-architecture.com/HAR/HAR002.htm and related pages

@The Earwig: This website has likely copied old versions of Wikipedia pages without attribution. For instance,

Cathedral of St. John the Divine - nyc-architecture, versus our article in 2007, comparison seen here. The nyc-architecture website copied the footnote number but removed any maintenance tags. There was a 98% match.
Grand Central Terminal - nyc-architecture versus our article in 2007, comparison seen here. The nyc-architecture website still has the reference numbers and "citation needed" tag. There was a 98% match (again).
St. Patrick's Cathedral (Manhattan) - nyc-architecture versus our article in 2007, comparison seen here. The nyc-architecture website still has the "citation needed" tag. There was a 98% match (again).
- More examples of this sort can be found by Google search: https://www.google.com/search?q=%5Bcitation+needed+site%3Anyc-architecture.com&oq=%5Bcitation+needed+site%3Anyc-architecture.com&aqs=chrome..69i57.5797j0j1&sourceid=chrome&ie=UTF-8

It's very likely that the other website copied from Wikipedia, since very few other websites have a need to use the citation needed tag, and since there is such similarity between each of the pages from 2007. Granted, this website still has original content, but I am more concerned about the false positives from Wikipedia. epicgenius (talk) 05:40, 1 December 2019 (UTC)[reply]

Added; thanks for your investigation! — Earwig ^talk 06:06, 1 December 2019 (UTC)[reply]

http://worddisk.com/wiki/

This appears to be a mirror of Wikipedia without attribution: it even has our main page at http://worddisk.com/wiki/search Caeciliusinhorto-public (talk) 14:49, 16 January 2020 (UTC)[reply]

Done Darylgolden^(talk) Ping when replying 04:21, 8 April 2020 (UTC)[reply]

Onlineradiobox

This site copied content from Udaya Geetham, and should therefore be excluded from EarwigBot. --Kailash29792 (talk) 17:21, 9 February 2020 (UTC)[reply]

Not doneThat site is copying from copyrighted source, so neither should be excluded. Crow^Caw 15:14, 30 April 2020 (UTC)[reply]

British and Irish Legal Information Institute (BAILII)

Earwig is flagging articles that quote from Irish Supreme Court case decisions hosted on BAILII. BAILII (here) and the Irish Courts Service (here) allow for direct quotation. British decisions can also be quoted. Could BAILII be removed from Earwig? AugusteBlanqui (talk) 13:42, 30 April 2020 (UTC)[reply]

Is there a good subdomain of the site for just the court decisions? In addition to allowing quotes of the court decisions, it also states The copyright in the text of legislation and judgments displayed on BAILII's website may belong to courts, other government bodies, judges, and/or to commercial publishers. BAILII cannot authorize any copying of such material. So if we whitelist the whole BAILII site we may miss catching some of those other cases. Crow^Caw 14:10, 30 April 2020 (UTC)[reply]

@Crow: This subdomain is safe to whitelist: https://www.bailii.org/ie/cases/IESC/ and this one: https://www.bailii.org/ie/cases/IEHC/ Thanks! AugusteBlanqui (talk) 14:21, 30 April 2020 (UTC)[reply]

Added. I note that their re-use policy just says "re-use" which is a little ambiguous. So to avoid any issues, please always quote the text (rather than incorporating it directly) and cite the web sites. But this should stop the Earwig matches. To other CopyPatrol users, this will not stop ErinBot from flagging these, which is probably a good thing. Crow^Caw 15:11, 30 April 2020 (UTC)[reply]

Thanks @Crow:. We will cite/quote from BAILII. A question, if you don't mind, on how subdomains work for Earwig. So https://www.bailii.org/ie/cases/IESC/ is the landing page for Irish decisions. All the Irish decisions have web addresses that start after the IESC, for example https://www.bailii.org/ie/cases/IESC/2007/S28.html . Is that what it means to whitelist a subdomain? The pages 'below' that IESC address are included (technology not my strongest domain). AugusteBlanqui (talk) 16:43, 30 April 2020 (UTC)[reply]

Yes that entry should whitelist everything after the trailing / in the url. If it doesn't, let me know. Thanks! Crow^Caw 16:47, 30 April 2020 (UTC)[reply]

Historic American Engineering Record articles hosted on nycsubway.org

@The Earwig: The following pages on nycsubway.org copy from the Historic American Engineering Record, a public domain source, and may bring up false positives.

May I request that only these specific pages be added to the exclusion list? Epicgenius (talk) 18:24, 30 December 2020 (UTC)[reply]

Done — The Earwig ^talk 19:15, 30 December 2020 (UTC)[reply]

Please add https://handwiki.org

It's a mirror site. Sudonet (talk) 09:13, 7 January 2021 (UTC)[reply]

Done — The Earwig ^talk 05:02, 8 January 2021 (UTC)[reply]

google-info.org

I thought I'd added google-info.org with this edit 12 April, but amp.en.google-info.org is still sullying the copyvio results (odd that the mighty corporate hand of Google hasn't yet come down to smite them). Should I have done something differently? BlackcurrantTea (talk) 16:19, 18 April 2021 (UTC)[reply]

@BlackcurrantTea: I fixed this last week, but didn't notice you had brought it up here. Bug on my end. — The Earwig (talk) 02:24, 14 May 2021 (UTC)[reply]

Exclusion

http://wikiorg.ru/wiki/*, because it's a clone of Ryussian Wiki. 78.37.129.71 (talk) 19:43, 13 May 2021 (UTC)[reply]

Done. — The Earwig (talk) 02:24, 14 May 2021 (UTC)[reply]

please add https://wordsimilarity.com

Could someone please add https://wordsimilarity.com/? It appears to be using Wikipedia directly, in, for example, https://wordsimilarity.com/en/avolition, messing with the Copyvio detector. Thanks!

EDIT: I also found https://eng.ichacha.net/zaoju/ , which seems to be sourcing text from Wikipedia for at least some of its pages, as well as https://en.glosbe.com/, which seems to often source from something called "WikiMatrix" (I haven't really looked into it).

Yitz (talk) 18:33, 14 May 2021 (UTC)[reply]

Done, added the three. — The Earwig _alt (talk) 18:48, 14 May 2021 (UTC)[reply]

Add spellchecker.net

It's not a mirror, but it seems to scrape random chunks of text from articles. Cheers, Estheim (talk) 22:47, 19 September 2021 (UTC)[reply]

Added, thanks. — The Earwig (talk) 01:55, 20 September 2021 (UTC)[reply]

Wikipedia:Mirrors and forks

Please add all of the WP mirrors listed under Wikipedia:Mirrors and forks. Thank you Jamplevia (talk) 23:52, 2 November 2021 (UTC)[reply]

@Jamplevia, this is already done: read the first line of the exclusions page. If there's a particular mirror you're still getting results for, it may be getting parsed incorrectly by the tool; if so, please indicate which. — The Earwig (talk) 02:52, 3 November 2021 (UTC)[reply]

nina.az

Hi,

I've got a question about "url = http://wikipedia.*.nina.az/". The tool still includes URLs starting with wikipedia.de.nina.az (see for example Copyvios Gioia) but the regular expression is supposed to ensure that these URLs are excluded. I didn't add the regex, but can anybody figure out why it doesn't work? I've tried matching the regex and the URL with re.match() in Python and there it worked. Thanks in advance. --CaroFraTyskland (talk) 09:30, 7 November 2021 (UTC)[reply]

Please add https://hmong.ru/

It's another mirror site. Thanks, SamWilson989 (talk) 23:12, 29 May 2022 (UTC)[reply]

And https://wiki2.net too 92.242.69.182 (talk) 19:01, 29 June 2022 (UTC)[reply]

Additions

If I spot sites that seem to be plagiarising Wikipedia without attribution should I just add them to the list and forget about them or should they be reported somewhere else for the Wikimedia Foundation to lean on?

For context, I was cleaning up Nick Weir and I found:

Both of these seem to be (badly) processed from old versions of our articles, possibly by an AI. Which category should those go in? They are not exactly mirrors. DanielRigal (talk) 15:21, 20 August 2022 (UTC)[reply]

Please add http://ikonysrebrnegoekranu.blogspot.com/

This blog (http://ikonysrebrnegoekranu.blogspot.com/2017/01/) contains a nearly plagiarized version of the article from Polish-language Wikipedia (https://pl.wikipedia.org/wiki/Popi%C3%B3%C5%82_i_diament_(film)), which was expanded back in 2012/2013; meanwhile, Copyvio returns a false 96% plagiarism score on the Wikipedia side. Ironupiwada (talk) 12:10, 5 September 2022 (UTC)[reply]

Please add https://frwiki.wiki/

This is another mirror site of Wikipedia. Ironupiwada (talk) 12:21, 5 September 2022 (UTC)[reply]

Please add https://timenote.info/

Another fork of Wikipedia, it even mentions Wikipedia as a source, which leads to false positive Copyvio results (like here). Ironupiwada (talk) 12:58, 5 September 2022 (UTC)[reply]

Please add https://wiki.edu.vn/wiki25/

Fork, it mentions wikipedia as a source. Friniate^talk 15:30, 12 October 2022 (UTC)[reply]

Please add latitude.to

Please add latitude.to to the exclusion list. It's not exactly a mirror, more of a Wikipedia link farm, but it still gave me a false positive. It's already on the link spam blacklist, as I found when I tried to add a link to this comment as an example :) Apocheir (talk) 03:13, 8 April 2023 (UTC)[reply]

Scrapes of wiki pages

It seems that many of the sites in the blacklist are there because they are scraping the wiki. As these appear and disappear frequently, I suspect this leads to a lot of maintenance workload. I'm wondering if there is not another solution to the problem that is based on back-testing?

The example that lead me here is this one for toroidal solenoid:

Earwig's Copyvio Detector indicates a similarity to an article on Zeta (fusion reactor) in Hellenicaworld. Can we be certain that the Toroidal solenoid article predates the Hellenicaworld article? Nolabob (talk) 12:10, 28 July 2023 (UTC)Reply

That Zeta article is a copy of the one I wrote here on the Wiki some years ago. My new article does indeed have bits in common with Zeta, and that is entirely deliberate, both are early UK fusion systems. The copyvio between the two pages here on the wiki is of course suppressed, but not the one with this 3rd party scrape.

It would seem that this could be avoided by testing to see if the external hit is a scrape. In this case, it would match to some very high degree, and thus be "likely a wikipedia scrape". This would require two matches on each possible hit, and I'm not sure what that would do to the performance, but I think it might avoid a lot of false positives? Maury Markowitz (talk) 17:58, 28 July 2023 (UTC)[reply]

Good idea, perhaps at least having a leaderboard with the sorted list of domains by match would be a good start to discover easily new mirrors. Thanks, Framawiki (please notify me when you reply) 19:31, 28 July 2023 (UTC)[reply]

@Framawiki: Oh, yes, that might be a great intermediate solution. Maury Markowitz (talk) 21:00, 28 July 2023 (UTC)[reply]

Add https://www.populartimelines.com/

Says on the second search bar that it uses wikipedia as a source. I have also found this article https://medium.com/@populartimelines/timelines-of-famous-people-events-companies-and-more-726de9cb8950, but I'm unsure how reliable this website is. 2001:8F8:1123:D698:493A:CC2:EDDC:5AED (talk) 08:28, 5 November 2023 (UTC)[reply]

Done. LittlePuppers (talk) 16:36, 5 November 2023 (UTC)[reply]

Add vintageisthenewold.com

Noticed here during a DYK nom. Seems to copy information verbatim from sites including Wikipedia. IceWelder [✉] 09:45, 14 November 2023 (UTC)[reply]

@IceWelder: some of their content is definitely copied from WP, but there's also a lot that is not; I'm a bit hesitant to put it on the list because I can't see a good way to isolate just the copied-from-Wikipedia pages. LittlePuppers (talk) 18:38, 14 November 2023 (UTC)[reply]

There is no original writing, so if it does contain something from a third-party sure that happens to be infringed on, surely the tool would also find the original source? IceWelder [✉] 00:59, 15 November 2023 (UTC)[reply]

Should we add https://www.gbif.org/ ?

There's a website https://www.gbif.org/ where it publishes free information. It even uses Wikipedia articles on URLS beginning with https://gbif.org/species/. Myrealnamm's Alternate Account (talk) 19:23, 10 December 2024 (UTC)[reply]