Jump to content

User talk:Rich Smith/Archive89

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia


The Signpost: 31 October 2021

20:27, 1 November 2021 (UTC)

20:35, 8 November 2021 (UTC)

ClueBot NG on SqWiki

Hey Rich!

I'm a crat from SqWiki. These days I was shown ClueBot NG from a user when I asked him advice in fighting vandalism. Would it be possible to make ClueBot NG work in wikis other than EnWiki? We (and I believe a lot of other wikis as well) would be really grateful to benefit from it if it was possible. - Klein Muçi (talk) 00:49, 1 November 2021 (UTC)

@Klein Muçi: it can, however I needs a lot of training data. Pinging @DamianZaremba: to see if he can provide more input to what is required - RichT|C|E-Mail 07:06, 1 November 2021 (UTC)
Yeah, I understand that because I read how it worked. I was thinking to maybe keep it on a kind of a "simulation" mode while it learned (maybe just don't give it the bot flag yet?) and later unleash it in full power. - Klein Muçi (talk) 11:32, 1 November 2021 (UTC)
I don't think it quite works like that, the bot flag is irrelevant. @Cobi: could maybe assist as well? - RichT|C|E-Mail 11:33, 1 November 2021 (UTC)
At the very least, the bot needs a several tens of thousands of randomly sampled main-space edits categorized as good or bad to even have a chance of being reasonably accurate, but ideally more. I also do not speak Albanian, so I couldn't reasonably offer support for false positives or anything like that. The bot itself is open source, and most of the tooling should be in the repo.
It seems that DamianZaremba's been reworking some of the training tooling, but the original training tooling is mostly here. It's a bit of a mess since it is mostly a snapshot of some of our working directories back when we were originally training the bot. The basic idea was there was a MySQL database called EditDB, and it had a table called editset.
Tools like editClassificationToEditDB.php took data in on stdin in the form of "123456 V" or "234567 C" to mark revid 123456 as vandalism and revid 234567 as constructive. Tools like generateXML.php would then emit XML suitable for training the bot's core from the edits in the EditDB. Tools like autodatasetgen.go were built to find other ways of generating classifications like by checking if someone reverted real-world edits. This was not as effective as the smaller (but still large) hand-curated datasets.
Finally, after using generateXML.php to generate train.xml, trial.xml, and bayestrain.xml in the editsets directory (we used limit clauses to split the files, with 0-16000 in bayestrain.xml, 16000-60000 in train.xml, and the rest in trial.xml), we then ran trainandtrial.sh to train the bot and then get metrics on the efficacy of the bot. There are also tools like autotraintrial.php which attempts to explore reasonable ANN parameters which are stored in localtoolconfig and what we believe to be reasonable values for training datasets between 50,000 and 100,000 edits.
If any of that made some sort of sense, you may wish to give it a go. If not, maybe find a bot dev on SqWiki that has time and desire to curate and run a SqWiki version? -- Cobi(t|c|b) 03:54, 5 November 2021 (UTC)
@Cobi, thanks a lot for taking the time to explain the details! I followed every provided link along with your explanations. I saw that there hadn't been any changes for the last decade almost so I do understand that it may appear as an "old project" for you. I have a naïve question I couldn't understand from your explanation though: You say that the bot should use around 50k results (just an example) divided into C and V type to start its training which then gets information added by reverts and more. Then you also mention "hand-curated datasets". Should I understand that those initial 50k results (again, just an example) were divided into C and V type manually? If I'm misunderstanding that, how was that initial division made?
The reason I ask is because if there's one thing we (and all the small wikis) lack is a large active userbase. We struggle so much with having an active working force that that was actually what brought me here. Even after setting up strict edit filters and trying to block vandals fast, still the number of pages and changes pending review is so large that it's unmanageable by us. (We lowered it to 0 some time ago but still...) Therefore it's unfortunately very common for changes to expect review for months if not years before someone actually comes to do that. Lately we started being attacked by some IP vandals which come and change just small trivial information on articles for example the name of the city where someone is born or the date when someone died or the number of works published by someone. These are undetectable by the filters and are unblockable for very large periods of time because they're IPs (and more than one) and they're not on the same IP range. This not only lowers the project's overall integrity but also increases the workload for the already non-existing patrollers which starts a vicious cycle: New patrollers/reviewers may become interested in helping and seeing the extremely large number of pending changes feel like their work won't matter and leave which only makes the number grow more. When I asked for help here in dealing with this situation, Xaosflux showed me your bot. It is crucial for us in automatizing vandalism fighting so we can have a chance in reviewing the remaining constructive edits which may or may not be acceptable for SqWiki standards.
Currently I'm the only active one dealing with bot developing in SqWiki. I run a bot myself which operates in SqWiki, SqQuote and LaWiki but it's a rather simple one working on the Pywikibot framework and the occasional AWB changes. I haven't had a chance to work on GitHub yet even though I have an account there, if I'm not wrong. I can try starting that journey (even though I'm an autodidactic coder) but I'd need a lot of guidance along the way. To be honest what I was expecting was to work towards some localization "tables", like I've done with the other imported bots in the past (maybe, most notably, IABot), not duplicate the code. I highly expected Cluebot's functionality to have been requested by many Wikis during its existence and i18n infrastructure to be already implemented in it. I was surprised to understand that I may be one of the few (if my understanding is correct) users who's going on with a request like this. - Klein Muçi (talk) 10:52, 5 November 2021 (UTC)
@Klein Muçi The links I posted are to the original versions of the files since the original training hasn't changed in the decade or so. The bot itself has been updated more regularly in the bot repo and the core repo. But, yeah, we collected and categorized some of the edits ourselves, and some had been collected by open research projects that have analyzed vandalism on enwiki, and some were crowd-sourced by using a web-interface that let others we trusted categorize edits.
Essentially at a high level, the bot takes the edits and generates hundreds of statistics about each edit and then compares them against the known good and known bad edits' statistics using an Artificial Neural Network. If it looks like good edits more than bad edits, it leaves it alone, otherwise it reverts it. This is essentially what machine learning is.
This does, of course, lead to why the bot hasn't been localized, yet. It needs a completely new data-set for each new wiki it operates on, and no one has taken on that challenge yet. It's also why ClueBot NG does not operate on other English wikis other than en.wikipedia, because the data-set actually needs to be made for the wiki in question, not just the language. For example, an article on the English Wikipedia would look totally different than one on the English Wikinews or Wiktionary, and because the bot works by looking at an edit and trying to determine whether or not it belongs based on its data-set, it would notice the differences. The actual strings tables used for messages themselves are trivial to update for localization in comparison with the data-set.
Other projects have asked for ClueBot NG before, but not that often. I've told them essentially what I've told you: The bot is open source, but you have to collect a data-set for it to work. There is also the old version of ClueBot that could potentially be used and updated, but its functionality was limited and largely eclipsed by the Edit Filter, and much less effective than the machine learning approach that ClueBot NG uses. -- Cobi(t|c|b) 13:53, 5 November 2021 (UTC)
@Cobi, I see now. My initial expectation was that you could "load it on background", like a third party app, and it would collect information in regard to our community's reviews (what we accepted and reverted), eventually constructing the needed dataset and when it was sufficiently trained, we could release it into the wild. You say that that thing does happen but first you need to feed it a lot of premade datasets before you can come at that phase. How wrong am I? - Klein Muçi (talk) 01:15, 6 November 2021 (UTC)
For reference, the work I did basically consumes the reviewed edit set (which right now only includes historical entries), effectively meaning there is not nearly enough data to actually re-train the bot, let alone verify it's within tolerance. As far as I know we do not have the original training set used for the production datasets and the review interface effectively died so was started to be re-written in a form that could work on toolsforge. You can see the current training logic on GitHub, the output of which is calculated each day under trained-datasets. Given the current (historical) community interest in reviewing reported edits, I don't foresee being in a position to re-train en.wiki without substantial work, let along support another wiki. - Damian Zaremba (talkcontribs) 17:02, 15 November 2021 (UTC)
@DamianZaremba, I see... Well, if anything else, this has been informing. Even though I wasn't able to get the hoped results, thanks for taking the time to reply to my answers. :) - Klein Muçi (talk) 18:22, 15 November 2021 (UTC)

22:05, 15 November 2021 (UTC)

Thanks for removing the template. I usually do that, but this one escaped my attention. I appreciate your help. Eddie Blick (talk) 03:23, 21 November 2021 (UTC)

20:01, 22 November 2021 (UTC)

ArbCom 2021 Elections voter message

Hello! Voting in the 2021 Arbitration Committee elections is now open until 23:59 (UTC) on Monday, 6 December 2021. All eligible users are allowed to vote. Users with alternate accounts may only vote once.

The Arbitration Committee is the panel of editors responsible for conducting the Wikipedia arbitration process. It has the authority to impose binding solutions to disputes between editors, primarily for serious conduct disputes the community has been unable to resolve. This includes the authority to impose site bans, topic bans, editing restrictions, and other measures needed to maintain our editing environment. The arbitration policy describes the Committee's roles and responsibilities in greater detail.

If you wish to participate in the 2021 election, please review the candidates and submit your choices on the voting page. If you no longer wish to receive these messages, you may add {{NoACEMM}} to your user talk page. MediaWiki message delivery (talk) 00:27, 23 November 2021 (UTC)

The Signpost: 29 November 2021

Articles you might like to edit, from SuggestBot

Note: All columns in this table are sortable, allowing you to rearrange the table so the articles most interesting to you are shown at the top. All images have mouse-over popups with more information. For more information about the columns and categories, please consult the documentation and please get in touch on SuggestBot's talk page with any questions you might have.

Views/Day Quality Title Tagged with…
348 Quality: Medium, Assessed class: C, Predicted class: B Perfectionism (psychology) (talk) Add sources
1,492 Quality: Medium, Assessed class: C, Predicted class: B Reservation in India (talk) Add sources
102 Quality: Low, Assessed class: Start, Predicted class: Start Suryakantham (actress) (talk) Add sources
12 Quality: Medium, Assessed class: NA, Predicted class: C JCW Tag Team Championship (talk) Add sources
4 Quality: High, Assessed class: C, Predicted class: GA Bloodymania IV (talk) Add sources
590 Quality: High, Assessed class: Start, Predicted class: GA Mohammad Hafeez (talk) Add sources
9 Quality: Medium, Assessed class: C, Predicted class: C EN World (talk) Cleanup
129 Quality: Medium, Assessed class: Start, Predicted class: C Metalanguage (talk) Cleanup
30 Quality: Medium, Assessed class: C, Predicted class: C St. Stephen's & St. Agnes School (talk) Cleanup
6 Quality: Low, Assessed class: Stub, Predicted class: Start Red Arrow, Black Shield (talk) Expand
15 Quality: Low, Assessed class: Stub, Predicted class: Start Minimum Essential Emergency Communications Network (talk) Expand
5 Quality: Low, Assessed class: Stub, Predicted class: Start Dark Sun Campaign Setting, Expanded and Revised (talk) Expand
266 Quality: Medium, Assessed class: C, Predicted class: C Idukki district (talk) Unencyclopaedic
14 Quality: Medium, Assessed class: Start, Predicted class: C Eddie Lawrence (talk) Unencyclopaedic
138 Quality: Medium, Assessed class: C, Predicted class: B Cinema of Sri Lanka (talk) Unencyclopaedic
589 Quality: High, Assessed class: C, Predicted class: GA Sleigh Bells (band) (talk) Merge
22 Quality: Medium, Assessed class: Start, Predicted class: C Object language (talk) Merge
15 Quality: Medium, Assessed class: Start, Predicted class: C Hitachi Construction Machinery (Europe) (talk) Merge
50 Quality: Medium, Assessed class: Start, Predicted class: C Lufia: Curse of the Sinistrals (talk) Wikify
46 Quality: Medium, Assessed class: Start, Predicted class: C Beithir (talk) Wikify
2 Quality: Medium, Assessed class: Start, Predicted class: C Cell culturing in open microfluidics (talk) Wikify
2 Quality: Medium, Assessed class: Start, Predicted class: C Maria Léa Salgado-Labouriau (talk) Orphan
2 Quality: Low, Assessed class: Stub, Predicted class: Start Khodadad Jalali (talk) Orphan
6 Quality: High, Assessed class: NA, Predicted class: GA UV-Vis absorption spectroelectrochemistry (talk) Orphan
4 Quality: Low, Assessed class: Stub, Predicted class: Start Kosmos 23 (talk) Stub
4 Quality: Low, Assessed class: Stub, Predicted class: Start Newberg School District (talk) Stub
200 Quality: Medium, Assessed class: Start, Predicted class: C Semantic Scholar (talk) Stub
2 Quality: Low, Assessed class: Stub, Predicted class: Start Kosmos 93 (talk) Stub
38 Quality: Low, Assessed class: Stub, Predicted class: Start Vimic (talk) Stub
3 Quality: Low, Assessed class: Stub, Predicted class: Start Kosmos 129 (talk) Stub

SuggestBot picks articles in a number of ways based on other articles you've edited, including straight text similarity, following wikilinks, and matching your editing patterns against those of other Wikipedians. It tries to recommend only articles that other Wikipedians have marked as needing work. We appreciate that you have signed up to receive suggestions regularly; your contributions make Wikipedia better — thanks for helping!

If you have feedback on how to make SuggestBot better, please let us know on SuggestBot's talk page. -- SuggestBot (talk) 11:24, 29 November 2021 (UTC)