User:Novem Linguae/Essays/Copyvio detectors
Appearance
This is a summary of enwiki's various copyright violation detector bots and tools.
Detection via Google searches
[edit]Earwig copyvio detector
[edit]- https://copyvios.toolforge.org/
- maintainer: The Earwig, Chlod
- source code: https://github.com/earwig/copyvios
- last commit: 3 years ago
- tech: Python
- uses Google search API and the WMF eranbot Turnitin API
- Google Search API
- WMF pays for credits
- no discount (NPerry (WMF) used to work on Wikimedia's partnership with Google, maybe this is something worth bringing up?)
- hard daily limit (maximum for any user of this API) of 10,000 queries per day
- costs US$50 per day
- makes up to 8 queries per page
- 2,000ish checks per day (not all checks use all 8 queries)
- as of Aug 2024, hitting the quota around hour 12 of the 24 hour day
- AI scraping bots may be to blame for this higher than normal usage
- to counter this, there are plans to require login / implement OAuth
- Google has the best breadth of search coverage
- Bing might be a reasonable backup, but not as good
- tool used to use Yahoo until they ended their free service
- have looked into Yandex, but English coverage isn't great
- someone had the idea of adding The Wikipedia Library / EBSCO as another search backend, but discussions with EBSCO stalled
- Google Search API
- has issues with concurrent queries
- uptime report: https://stats.uptimerobot.com/BN16RUOP5/784331770
- false positive handling via a community-maintained exclusion list at User:EarwigBot/Copyvios/Exclusions
- previous WMF contacts: Kaldari, Runab WMF, DTankersley (WMF)
Google API Proxy
[edit]- used by Earwig copyvio detector to access the Google API
- https://openstack-browser.toolforge.org/project/google-api-proxy
- wikitech:Nova_Resource:Google-api-proxy
- maintainer: MusikAnimal
- purpose: this proxy uses a static IP, and there appears to be an IP whitelist on the Google API side, so I guess this proxy increases security?
Detection via Turnitin
[edit]CopyPatrol (rewrite)
[edit]Frontend
[edit]- https://copypatrol.wmcloud.org/en
- maintainer: WMF Community Tech team (most active recent committer: MusikAnimal)
- source code: https://github.com/wikimedia/CopyPatrol
- last commit: 2 months ago
- tech: Symfony (PHP)
- replaced https://copypatrol.toolforge.org/en
- is mostly a viewer for an SQL database that the copyright detection bot(s) below writes to
- users can mark pages/revisions as being fixed or requiring no action. (However, this information is not reflected on enwiki)
- there is a "compare" feature in the CopyPatrol interface. clicking on it does an API query to the Earwig tool above
Backend
[edit]- bot name: CopyPatrolBot
- BRFA: Wikipedia:Bots/Requests for approval/CopyPatrolBot
- maintainer: JJMC89
- source code: https://github.com/JJMC89/copypatrol-backend
- last commit: 2 months ago
- tech: Python
- rewrite of EranBot's copyright tasks
CopyPatrol (original; undeployed)
[edit]This discussion has been closed. Please do not modify it. |
---|
The following discussion has been closed. Please do not modify it. |
Frontend (wikimedia-slimapp)[edit]
Backend (EranBot)[edit]
|
See also
[edit]- phab:T330435 - I read this and added its contents to this essay
- Wikipedia:Turnitin
- Wikipedia:Village pump (idea lab)#Brainstorming a COPYVIO-hunter bot - I read this and added its contents to this essay
- Wikipedia:WikiProject Articles for creation/AfC Process Improvement May 2018
- Wikipedia:WikiProject Articles for creation/AfC Process Improvement May 2018/Copyvio solutions comparison report
- Wikipedia:Village pump (WMF)/Archive 7#Copyright tool - I read this and added its contents to this essay