Jump to content

Wikipedia talk:WikiProject History Merge/Archive 1

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Archive 1

New histmerge list


The pages where User:Conversion script created both the redirect and the target page *are* cut-and-paste moves. As it says at Wikipedia:Usemod article histories the very last edit made to a page when Wikipedia used UseModWiki and the conversion by Conversion script are rolled up into one edit. If the last edit prior to the conversion was done before 20 December 2001, you can find out the history of it at Nostalgia Wikipedia, which contains a complete database snapshot from that date. In the case of AbadanIran, here's the diff of changing the page to a redirect and here's the revision where it was moved to Abadan, Iran. Here's the diff showing the cut and paste move, fixed a CamelCase title. I usually history merge examples like these, but I can't think of one off-hand at the moment.

When Conversion script is listed as the first editor to a page, it means that there was exactly one edit to the page before the conversion from UseModWiki. For example, see this edit to netball on the English Wikipedia and the page history of netball at the Nostalgia Wikipedia.

Hope this makes some sense. Graham87 05:42, 19 June 2009 (UTC)

I've gone ahead and history merged the above-mentioned pages. Graham87 17:14, 24 June 2009 (UTC)

Automate?

  • I suspect that the histmerging could be automated in reasonably undoubted elementary cases, when:
    • The "pretext" is the last edit of page A, or only followed by one or a few redirect edits.
    • The "posttext" is the first edit of page B, or only preceded by a one or a few redirect edits
    • Neither page already has any deleted edits.
    • The page names are similar.
    • If pages Talk:A and Talk:B both exist, rename Talk:A as Talk:B/Archive <some number>
    Anthony Appleyard (talk) 20:51, 16 June 2009 (UTC)
    Anthony...believe me, if/when I become an admin, I'll gladly help you out with these. Matt (talk) 00:39, 17 June 2009 (UTC)
  • It would be useful if your bot, when setting up User:Mikaey/Possible cut-and-paste moves, could on each line add the date and time of each pretext and posttext edit mentioned. That would let the user distinguish recent cut-and-pastes, which to my mind would have more priority than the mass of very old cut-and-pastes made in old times when that was the only way to move a page because the modern move-a-page facility had not yet come. Anthony Appleyard (talk) 04:57, 17 June 2009 (UTC)
    Hmmm...I suppose I could do that. Matt (talk) 04:59, 17 June 2009 (UTC)
  • Cut-and-pastes could likely be split into 3 date-classes:
    1. The old mass before [move] was brought in.
    2. The tail-off while people were getting used to the [move].
    3. The recent period.
    Anthony Appleyard (talk) 05:02, 17 June 2009 (UTC)
    I'm rather reluctant right now to make that sort of change to the code, just because that would require a fair amount of rewrite and extra processing time to implement, and I just don't feel like doing it at the moment (the bot would have to compile the entire list of histmerge candidates first, then sort and separate them). Not to say that I wouldn't ever do it, but not right now. However, I agree with you that a bot could be very useful in dealing with this list. I wouldn't necessarily be opposed to writing/operating such a bot; however, I see it as something that could potentially stir up some controversy, especially since the bot would need the admin bit. I know one question that would need to be answered (and you may be able to answer this better than anyone) is "what should be done with the talk pages?" Perhaps we should start up a bot request for approval discussion on the bot owner's noticeboard to see what people think... Matt (talk) 05:29, 17 June 2009 (UTC)
  • Perhaps the bot could write into 3 lists instead of one as it finds the suspected cut-and-pastes:
    1. User:Mikaey/Possible cut-and-paste moves/old, for suspected cut-and-pastes made before the move-a-page ability came in; mark which of these were made by User:Conversion script .
    2. User:Mikaey/Possible cut-and-paste moves/tailoff period, for the tailoff period.
    3. User:Mikaey/Possible cut-and-paste moves/recent, for later suspected cut-and-pastes.
    Anthony Appleyard (talk) 05:40, 17 June 2009 (UTC)
    If we go forward with this and get both a BRfA and an RfA approved for this bot, then yes, I will make those changes. Matt (talk) 05:45, 17 June 2009 (UTC)

Talk pages

  • what should be done with the talk pages?:
    • If Talk:B exists, and Talk:A does not exist; and no subpages Talk:A/... exist, nothing to do
    • If Talk:A exists, and Talk:B does not exist; and no subpages Talk:B/... exist, move Talk:A to Talk:B , and all Talk:A/... subpages with it
    • If Talk:A and Talk:B both exist, and neither has subpages, move Talk:A to Talk:B/Archive 1 and insert {{archive}} at its start; insert {{archives}} at the start of Talk:B
    • Else, call for help.
    I am tempted to suggest leaving no redirects for moves of talk pages and talk page subpages.
    Anthony Appleyard (talk) 06:07, 17 June 2009 (UTC)
  • Also, check if Talk:A was cut-and-pasted to Talk:B at the same time. Anthony Appleyard (talk) 05:51, 18 June 2009 (UTC)

Priority by year of incident?

Other topics

Deletion trouble

Beware of foreign relations articles

The foreign relations articles about cuntries were created by cut-and-paste moves, but the second edit of the page would add even more content pasted in from an article usually entitled "Transnational issues of <foo>". Therefore, I don't think it's appropriate to history merge foreign relations articles. See this diff, which sums up the situation nicely. Graham87 07:22, 19 June 2009 (UTC)

How to flag pages so that they don't show up on the list next time

Someone had mentioned at one point, somewhere, that there should be a way to flag false positives so that they don't get picked up on the list the next time around. Well, here's the solution -- tag the redirect with:

{{nahmc|<destination page>}} ("<destination page>" is the page that the bot suggested that the page should be histmerged into...sometimes, the bot suggests that one page should be histmerged into more than one page)

Before the next run, I'll modify the bot so that it excludes matches that match what's in this template.

For the curious, "nahmc" means "not a histmerge candidate.

Enjoy, Matt (talk) 16:20, 19 June 2009 (UTC)

Automating these history-merges?

  • I decided to embark on a little project, and wrote a program to try and figure out just how often cut-and-paste moves happen. The answer seems to be "OH MY GOD! THAT OFTEN?!?!". The program is working its way through the most recent database dump, and as of this writing, it's 6% of the way through, and it has registered over 3,700 hits. I've been in touch with User:Anthony Appleyard, the only admin who performs history merges on a regular basis. Both of us agree that this is way more than what he can handle, and I, not being an admin, can't do anything to help him.
    So, with that, any admins who are willing to help should take a look at User:Mikaey/Possible cut-and-paste moves.
  • Thanks! Matt (talk) 05:46, 18 June 2009 (UTC)
  • Sounds like fun. I do histmerges a lot. I'll look at it in the morning. –xenotalk 05:48, 18 June 2009 (UTC)
Good lord, that looks bad. I'll keep it on the list of things to do to kill time, but not gonna dive in just yet. And, not to change the subject, but bravo to "Double A" for his dedication to cut/paste move fixing - he was the first one that came to mind when I saw the title of this section. JPG-GR (talk) 05:58, 18 June 2009 (UTC)
Looks like a nice little place to hide and be productive, two things I enjoy doing. I'll look at some as time permits. Keegan (talk) 07:13, 18 June 2009 (UTC)
Thanks for the work. I was surprised to see that this is not a new problem with articles going back to 2004! Vegaswikian (talk) 07:22, 18 June 2009 (UTC)
I may be able to help as I've done some hairy history merges (some quite recently). It may be useful to create a hidden category and place these articles in it. Then, if you can run your program regularly and tag possible articles, it can be listed on the admin backlogs so more people will see it. Perhaps something like Category:Possible cut-and-paste moves. There should also be a way to indicate an article is not a cut and paste move so if it gets removed from the category it isn't placed back there again when the program is run again. ···日本穣? · Talk to Nihonjoe 08:28, 18 June 2009 (UTC)
I'm probably going to have to decline "tagging with a category tag" for right now. The bot doesn't have bot approval at the moment -- I shouldn't need it since it's not making any changes to the wiki -- but tagging pages with category tags would require bot approval. Matt (talk) 08:35, 18 June 2009 (UTC)
Nihonjoe is right though, it would be helpful if it did that. Since the task seems fairly easy, you might want to request bot-flagging to add such a feature. Regards SoWhy 09:00, 18 June 2009 (UTC)
(To SoWhy) Well, ok...however, I'll probably wait for the next run before I go through with this. Matt (talk) 12:56, 18 June 2009 (UTC)
  • See User:Mikaey/Possible cut-and-paste moves for technical terms used hereinunder. Anthony Appleyard (talk) 09:40, 18 June 2009 (UTC)
  • Could a bot be written to do this histmerging? I suspect that this bulk histmerging could be automated in reasonably undoubted elementary cases, when:
    • The "pretext" is the last edit of page A, or only followed by one or a few redirect edits.
    • The "posttext" is the first edit of page B, or only preceded by a one or a few redirect edits
    • Neither page already has any deleted edits.
    • The page names are similar.
    • After histmerging the articles, what to do with their talk pages?:-
    • If Talk:B exists, and Talk:A does not exist; and no subpages Talk:A/... exist, nothing to do
    • If Talk:A exists, and Talk:B does not exist; and no subpages Talk:B/... exist, move Talk:A to Talk:B , and all Talk:A/... subpages with it
    • If Talk:A and Talk:B both exist, and neither has subpages, move Talk:A to Talk:B/Archive 1 and insert {{archive}} at its start; insert {{archives}} at the start of Talk:B
    • Else, the bot should call for help from a human.
      • Also, check if Talk:A was cut-and-pasted to Talk:B at the same time as A was cut-and-pasted to B.
      • I am tempted to suggest leaving no redirects for these moves of talk pages and talk page subpages, to avoid "leaving litter"; and not to include in the histmerged page any redirect edits in A after the pretext, or any redirect edits in B before the posttext. Anthony Appleyard (talk) 10:37, 18 June 2009 (UTC)
  • When the bot obeys these requests, it should check first if each request that it obeys is still valid: someone may have obeyed it first, or someone may have done work on its origin page and/or its target page, or moved something, or deleted something, or whatever. Anthony Appleyard (talk) 21:24, 18 June 2009 (UTC)

Bot request

  • Please note this bot request (made by Mikaey/Mike on June 18) and this category (created by Mikaey/Mike on June 6). ···日本穣? · Talk to Nihonjoe 16:39, 18 June 2009 (UTC)
    This category is a predecessor of sorts to the current list. The criteria for that category was much more crude and far less accurate -- specifically, "the article is a redirect but its talk page isn't". I was checking them by hand, and finding pages needing histmerges in about 10% of them. So, now that I have a much more accurate list, I'm trying to get that category emptied so that it can be used for the items from this list. Matt (talk) 18:32, 18 June 2009 (UTC)
  • The correct link on the bot request page is now here, due to the thread being renamed. Just wanted to leave this note. Killiondude (talk) 22:22, 18 June 2009 (UTC)
  • MZMcBride raised an interesting point at the Botreq, that there's an internal mediawiki feature (presently deactivated) that can be used to more easily merge histories... –xenotalk 12:44, 19 June 2009 (UTC)

A note on new ones

Note that CSBot tags new cut-and-paste copies and list them at WP:SCV. I've noted, however, that people often think those were false positives and let them slide (despite the instructions on the page).

I can trivially add found cut-and-pastes on an additional page where you guys can keep an eyes on things. Holler if you think that would be useful. — Coren (talk) 14:55, 18 June 2009 (UTC)

Which method?

  • If page B was cut-and-pasted to page A, two ways have been mentioned to perform the histmerge:
    1. Delete A, move B to A, undelete A
    2. Delete B, move A to B, undelete B, move B to A
    (1) would be quicker than (2) if B had many more edits than A, as with an old article that was copy-&-pasted recently.
    (2) would be quicker than (1) if A had many more edits than B, as with an article that was copy-&-pasted several years ago.
    To get rid of trash before the histmerge (redirs after the pretext in B; redirs before the posttext in A), delete the page, then undelete the wanted edits. That seems to be a long way round. Is there any privilege level that can let a user directly delete some edits of a page, rather than deleting all edits and then undeleting a few edits? If so, it would be a much quicker way to delete a few edits from a long edit history. Anthony Appleyard (talk) 22:10, 21 June 2009 (UTC)
  • See RevisionDelete, which lets users do selective deletion. At the moment it's only enabled for oversighters, but it's supposed to be enabled for admins eventually. Also see Wikipedia talk:Selective deletion. Graham87 03:45, 22 June 2009 (UTC)
    • I wouldn't worry too much about the "garbage edits", just stuff that's already in the deleted revisions for other reasons such as BLP. –xenotalk 04:58, 22 June 2009 (UTC)
    • When oversighters can selectively delete edits, can this mean ordinary moving edits to that page's deleted edits list like I can do, or only to completely destroying an edit like oversighters can? 05:13, 22 June 2009 User:Anthony Appleyard
  • I am referring to what I have often found: stray redirect edits in A a long time before the posttext edit (e.g. someone realized that A was an alternate name for B but redirected instead of moving), and stray redirect edits in B a long time after the pretext (commonly adding "{{R from other capitalisation}}" and suchlike, and re-redirecting). In both cases, such stray junk gets shuffled in time order with edits from the other contributing page. Another type of such junk is: after someone cut-and-paste moves B to A, someone else or some bot thinks that B has been vandalised and reverts it to full text, and perhaps after that again to a redirect. Anthony Appleyard (talk) 05:12, 22 June 2009 (UTC)
      • When oversighters delete edits using RevisionDelete, the edits become invisible to admins for now. When the RevDelete permission is assigned to admins, they will be visible to administrators, unless the oversighter marks an edit as being completely hidden, which they rarely do.
    • Yes I delete those garbage edits as often as I can. They make diffs more confusing and generally clutter up the history. On a related note, it's currently impossible to separate edits with exactly the same timestamp, which can happen with page moves or many ancient edits that are said to be from 25 February 2002 because of an old database glitch. Therefore there are some cases where even admins can't properly reverse history merges, which will be solved with the RevDelete function. Graham87 15:56, 22 June 2009 (UTC)

Pace slowing down

False positive: ECG

The entry for ECG was most likely a false positive because the indicated revision of ECG is blank. Also see the edit summary of the first edit to electrocardiography. This seems to be an error in the script, so it would be better to fix the script rather than adding a tag to the ECG redirect. Graham87 16:09, 20 June 2009 (UTC)

I've reprogrammed the bot to ignore empty revisions. I think that should fix the problem. Matt (talk) 22:25, 21 June 2009 (UTC)

Tidying

  • To User:Mikaey: When this run of your bot is finished:
    • How long would it take for you to run your bot again, this time only checking those pretext-and-posttext pairs which are already listed? That will eliminate requests which have been obeyed, and requests which have become invalidated because of later editing actions; also it would change all requests to the latest format. Anthony Appleyard (talk) 05:20, 22 June 2009 (UTC)
      • Erm...a while. I don't really have it set up to check anything other than the data in the database dumps, and I don't really have a list of matches that the program could parse very easily -- all I have is the output files that I uploaded to the wiki. I guess my advice would be "feel free to skip anything that's not in the new format, and it will be in the new format after the next run". Matt (talk) 05:32, 22 June 2009 (UTC)
    • How long would it take for you to run your bot again, this time only checking edits which have dates and times after date and time X?, when making an addendum list for new cut-and-pastes. Anthony Appleyard (talk) 05:24, 22 June 2009 (UTC)
      • Probably about the same amount of time that it takes to run now. One of the procedures that is high up on the list is "fetch date/time of revision where article was turned from an article into a redirect". That requires looking at the history of the article, and the database dumps don't have any history -- they just have the current versions. The history still has to be requested from the wiki. Therefore, it wouldn't make things go any more quickly. Matt (talk) 05:32, 22 June 2009 (UTC)

Finished

The bot run seems to be finished, and it seems to have found 17197 hits (minus what have so far been obeyed.) Anthony Appleyard (talk) 16:34, 6 July 2009 (UTC)

Calculating diff score in terms of words rather than lines?

Would it be feasible to calculate the diff score in terms of words rather than lines, to catch cut and paste moves like this one? I only found it while looking up something on Google and stumbling upon Mencken's biography. Graham87 12:18, 7 July 2009 (UTC)

Why did I not notice this post earlier? Oh well. Anywho, here's the story -- my computer froze in the middle of AarghBot's most recent run. Nothing's gone, except for the progress it had made. However, I've been thinking lately that it would be nice if I weren't running so many bots on my computer; after all, the computer I run them on is my main computer. So, with that thought in mind, I began work on porting AarghBot to run on a machine I have sitting in my basement that is otherwise doing nothing. It wasn't easy -- the first incarnation of the bot was written in C#, to run on Windows; the machine sitting in my basement is running on Linux, so I had to use C++ and Qt to write it. However, the coding is done, and now I'm into the testing/debugging phase. Now, one of the problems I had in writing this is that there's not really a library available for C++ that will do diffs for you. So, I had to investigate alternatives -- and in fact, just in writing this post, I think I've come up with a pretty good alternative: Levenshtein distances.
For those of you unfamiliar with Levenshtein distances, they are basically a measure of "how many insertions, deletions, or changes need to be done to change string A into string B". It's measured on a per-element basis, usually with an element being a single character -- so, it would essentially be "how many characters need to be inserted, deleted, or changed in string A to transform it into string B." The previous method was, in fact, a Levenshtein distance (although it wasn't called that), but it was done on a line-by-line basis -- which worked well for detecting the addition of a line or two in an otherwise large article, but it would reject matches such as the one you provided, where the article was only one line, and only a few characters in the one line changed. So, after learning about Levenshtein distance (and thank god the article had enough pseudocode for me to actually code it), my initial thought was that I would use that, and compare the pretext and posttext on a character-by-character basis. This would solve exactly this problem, where only a few characters in one line had changed between the two versions, but it wouldn't work so well if one long line had been added to an article that consisted of a 10-20 short lines of text. Then, as I was writing this post, it hit me -- why not do both? So, the new method is to compute the Levenshtein distance on both strings -- first on a character-by-character basis, then on a line-by-line basis -- and use the smaller result of the two.
So, since I have the majority of the coding work done on AarghBot Linux, I'm going to focus most of my time and energy on that, instead of splitting it between the Linux and Windows versions. What this means is that, while I work all the bugs out, you may not see any new reports posted for a few days (or possible even a few weeks)...however, I think that, with the matches that the bot DID get (I posted the first 7500 hits before this incident), combined with what's left over from the last run we should have plenty of material to hold us over until then. Mikaey, Devil's advocate 06:49, 13 July 2009 (UTC)

Ignoring articles tagged with Template:Nahmc

Cases tagged with Template:Nahmc but should be histmerged

  • I have just found this histmergeable case which had been tagged nahmc: from http://wiki.riteme.site/w/index.php?title=Homozygote&oldid=91781611 (13:24, 3 December 2006, followed only by 5 redirects) to http://wiki.riteme.site/w/index.php?title=Zygosity&oldid=97629561 (23:49, 31 December 2006, creation edit). This case seems to have been failed because of the rule "If the predate and the postdate are not within 24 hours of each other"; but it is a valid case, and it seems that User:Dr d12 cut-and-paste moved a page which had not been edited in the previous 24 hours. I suspect that all currently nahmc-tagged cases need to be re-examined. Anthony Appleyard (talk) 09:01, 19 July 2009 (UTC)
  • Just remove the {{nahmc}} and the bot will pick it up again on its next go-around. I'll take a look through some of those and see if I can fix them. Mikaey, Devil's advocate 09:17, 19 July 2009 (UTC)
  • That was the first nahmc that I looked at. Likely many other of the nahmc's are like that. The rule "If the predate and the postdate are not within 24 hours of each other" may have to go. Anthony Appleyard (talk) 13:00, 19 July 2009 (UTC)
    • I tagged that page with Template:Nahmc. The reason should be clear by looking at these four edits to zygosity, which show that the Zygosity article was made by cutting and pasting three articles, one after the other. Therefore the history merge was thoroughly inappropriate, and I plan to undo it tomorrow if no-one else gets to it before me. I tagged most pages with Template:Nahmc for similar reasons; the tags were never added automatically as far as I know. Graham87 16:03, 19 July 2009 (UTC)
      • I disagree that it needs to be undone. Just because it's a merge doesn't mean that cut-and-paste didn't happen or doesn't apply. The first content the article had was cut-and-pasted from Homozygote, so I would argue that a histmerge is still appropriate for that article. If someone were creating an article by merging content in from other articles, I would fully expect the author to move the first page in the list to the name of the new article first, then cut-and-paste content from the other pages of the list into it. The histmerge process is supposed to fix cases where someone cut-and-pasted content from one article into another instead of using the "move" function, and this is precisely what happened. However, I'll agree with you that the other three articles on the list don't need to be histmerged into this one. Mikaey, Devil's advocate 20:57, 19 July 2009 (UTC)
        • I disagree with that assessment, since the order that the articles were cut and pasted was arbitrary. If one of the other articles (e.g. heterozygote) happened to be the first article to be cut and pasted, it would have come up in the history merge list. The only reason we're discussing this case is because the person cut and pasted the entire article contents of three articles into one, which is normally not how merges are done. Frankly I would have expected the author to use the preview button when cutting and pasting the articles; this revision is just bizarre and I would argue that this revision represents the actual creation of the zygosity page, as a merger of three articles. When doing a merger like the zygosity one, I would simply expect the first edit to have "merged from A, B, and C" in the edit summary and a note on the talk page. I know of one page whose early history was merged from *three* different articles, Share taxi, which I disagree with because it makes the page history difficult to read. I'll hold off splitting the zygocity page for a little while in case further comments come in. Graham87 01:20, 20 July 2009 (UTC)
          It was arbitrary, but I would still expect that the user creating the article, if they were creating an article entirely by merging in several other articles, to start with an existing article, and to move that first article into place before merging in the other articles.
          Think about it this way -- one of the main reasons that we do histmerges is so that we keep as much of the article's history as possible in one place, so that we can attribute the contributions of each user properly. The article clearly started out as the work of other WP editors. Does not histmerging Homozygote into Zygosity serve that purpose? (Now, admittedly, only histmerging one article into the destination is not the best solution, but unfortunately, trying to histmerge multiple pages in, without some way to mark which edits came from where, would only make the edit history that much more confusing.) Mikaey, Devil's advocate 03:28, 20 July 2009 (UTC)
          Yes, the purpose of history merging is to make sure that as many authors as possible can be attributed through the page history. But I see history merging as an all or nothing deal: either history merge *everything*, which isn't ideal if two pages were merged, or history merge nothing at all. The instructions at Help:Merging say nothing about history merging; they just say that it's mandatory to note the merged articles in the edit summaries. I also don't see the point of having two articles about two different topics in the page history, or having the talk page for homozygote separated from the revisions that used to be at the title "homozygote". Graham87 04:54, 20 July 2009 (UTC)
          • I think it's important to establish what your starting point was. In this case, Homozygote was his starting point. Had Hemizygote been cut-and-pasted to create the first version of the page, then I would have histmerged that one instead. I don't consider the first version of this page to be a merge, because the article didn't exist to begin with, and the text was nothing more than an exact copy of another article. Had the user cut-and-pasted all the articles at one time and saved them all as one revision, then the situation would be different; but that's not what happened. As an interesting sidenote, the user gave the wrong article titles in his edit summaries; all the articles he noted are, and always have been, redirects...so even if he had performed the entire merge in one edit, the attribution requirements still wouldn't have been met.
          • I agree that, in an ideal world, we should be able to histmerge all the articles together, but the MediaWiki software lacks the ability to tell us where individual revisions came from, so that's not a feasable option at the moment, so I prefer to do as much as I can, while keeping things coherent. Since homozygote was used as the starting point, it makes things more coherent to have that text that was used as the starting point in the history.
          • Also, we have articles all the time that are moved and expanded to cover more than just the original topic -- for example, people covered by WP:BLP1E are generally moved and expanded to cover the event. Mikaey, Devil's advocate 05:49, 20 July 2009 (UTC)
            • You're right that Dr d12 gave the wrong article titles when he noted where the text came from. A suboptimal solution would be to either make a dummy edit, noting where all the history is. The page Talk:Zygosity has plenty of discussion about re-splitting the articles, just to make things even more interesting. If they're re-split, I think the homozygote history should go back to where it was before. :-) But for now, it's not a huge deal. Graham87 13:14, 20 July 2009 (UTC)

More histmerge candidates in old pages

As I said at the end of this section, pages where Conversion script made both the last edit to the pretext and the first edit to the posttext *are* history merge candidates. Also, there are cases where a human editor clearly made the first edit to the posttext, but Conversion script appears to have made the redirect edit in the pretext; these are also history merge candidates. For an example, see this edit to Siberian Husky (the first edit of that diff was at SiberianHusky), and this edit to SiberianHusky. Graham87 12:03, 22 July 2009 (UTC)

I'll take that filter out before the bot's next run.
There are a couple of variables that can be fine-tuned here. Do we want to adjust them at all?
  • Maximum time difference between source being turned into a redirect and destination being turned into an article is currently set at 24 hours.
  • Maximum diff score between the pretext and posttext is currently set at 10%.
Mikaey, Devil's advocate 17:33, 22 July 2009 (UTC)
I think that the maximum time difference between the source being turned into a redirect and the destination being turned into an article should be set at 24 hours, unless one of the dates is before March 2002. After February 2002, even a time difference of 24 hours would be a worry; I'd expect it to be say fifteen minutes, but maybe that's just me. :-) Perhaps increasing the maximum diff score a little bit, to say 15% or 20%, would help, but calculating the diff score by words is probably sufficient. Graham87 00:28, 23 July 2009 (UTC)

Why didn't the bot catch this cut-and-paste move?

Why didn't the bot catch this cut and paste move? The edit on the left was originally at the title Gnome desktop. It seems to have met all the criteria, as far as I could tell. I'm going through the deleted contributions of Conversion script, and I'm finding some interesting things when narrowing the search down to the talk namespace. Graham87 12:43, 23 July 2009 (UTC)

Yeesh, I had to stare at that one for a while before I figured it out. The reason would probably be because the bot identified this edit as the posttext, meaning that the two edits would have been more than 24 hours apart. Mikaey, Devil's advocate 06:36, 24 July 2009 (UTC)
And I would undelete that edit, but the full revision history of the Gnome article isn't available in the English Wikipedia database. It's all in the version of GNOME on the Nostalgia Wikipedia, however. Graham87 06:59, 25 July 2009 (UTC)

Beware of chain cut-and-pastes

I know that I've made it my normal practice (and I'm sure a few of you have as well) to ignore results where the source page, all of its revisions, and the cut-and-paste move were all performed by the same user, but I just encountered a situation that has probably convinced me not to do that ever again.

Back in the first half of the month, I did a histmerge from Long distance trails in the United States to Long-distance trails in the United States. However, as a result of this histmerge, a new result popped up in a later run: Long distance footpaths in the United States to Long-distance trails in the United States. I think what happened here is that Long distance trails in the United States was C&P'd to Long distance footpaths in the United States, which was later C&P'd to Long-distance trails in the United States -- a chain cut-and-paste. I don't know how often this happens, but apparently it does happen. Mikaey, Devil's advocate 06:56, 25 July 2009 (UTC)

It's rare, but it can happen. It happened at Condoleezza Rice. I've encountered chains of four pages before, such as at administrative counties of England. Some chained history merges can get quite convoluted, like the one I described here. Graham87 07:14, 25 July 2009 (UTC)

Maybe the 24-hour rule should be removed

Perhaps the limit of 24 hours between when a redirect is created and when a page is moved should be changed or removed altogether. I recently did a history merge on House Un-American Activities Committee; the history around July 2005 is a bit hard to read because of overlapping edits. Anyway, here's the revision of the page being moved and the revision where the redirect was created. Just to make things even more interesting, the *redirect* was moved in July 2007. Graham87 08:55, 27 July 2009 (UTC)

Huh. Well, I can add that onto the list of changes for the next run. Originally, that was intended to filter through what would have otherwise been a long list of hits, but that was before I had the idea of computing the diff score between the two texts, so I guess that's not entirely necessary anymore. Mikaey, Devil's advocate 09:09, 27 July 2009 (UTC)

Weird history merges

I found out that Salamanca needed a history merge purely by accident while checking out some ancient contributions; Here's the diff between the pretext and the *real* post-text - the edit on the left was previously at the title "Salamanca, Spain". I have a hunch that it wasn't listed on the histmerge list because of some intervening edits, which can now be found here at the history of Salamanca (disambiguation)". I think it'd be a good idea if AarghBot would check the history of the target page more carefully. Maybe it could keep on checking the history of the target page until the size of the revision is at least 90% of the size of the pretext.

Also, there seems to have been a weird bug in July 2002 with the move page feature, where it would only move one edit to the new name while leaving the rest of the history at the old name. It might have only happened at pages with a colon in their title. I think I've dealt with all of them: 2001: A Space Odyssey (film), 2010: Odyssey Two, and several Star Trek pages, but there may be more. Graham87 11:39, 12 August 2009 (UTC)


But in the case of techno, which I just history merged, the edit that matched the pretext was the 27th-earliest edit, which might have taken a while to find. That's a bizarre case, though. I found it while checking out the deleted contributions of User:Justfred. Graham87 16:15, 12 August 2009 (UTC)
Hmmmm...I worry that we'd get a lot of false positives doing it that way. In fact, I'm worried about taking out the 24-hour rule for the same reason. The bot was designed to be able to find the more obvious cases -- I think what you're suggesting would take us into the more obscure cases, when we still have over 20,000 of the more "obvious" cases to fix. Mikaey, Devil's advocate 21:55, 12 August 2009 (UTC)
Yeah, you're probably right, seeing how CPU-intensive the Levenshtein Distance filter is. I just love trying to rescue old edits and history merge pages on popular topics, so I suppose that's reflected in how I approach history merging. Graham87 06:36, 13 August 2009 (UTC)

Next run

Ok folks, the bot completed its last run today. I've made the following changes to the code, and I'm about to start the bot up again:

  • The filter that removes cut-and-pastes made by User:Conversion script has been removed.
  • The maximum difference between the predate and the postdate has been relaxed to 72 hours. Graham, I know you said you'd like this removed, but here's the problem -- this filter removes a lot of pages that would have otherwise made it to the Levenshtein Distance filter. That filter is the most time-consuming and CPU-intensive of all the filters in the program (that filter sometimes spends a few hours on one page). That said, the more pages that can be saved from being run through that filter, the better. However, to respect your request, I think that relaxing the time limit should be a good compromise.
  • The previous methodology for the Levenshtein Distance filter was to run it on a line-by-line basis first, and then, if the distance was over the threshold, to then compute it on a character-by-character basis. The character-by-character comparison has been replaced with a word-by-word comparison. (For those of you familiar with regular expressions, the string is split by the expression "\b".)
  • The threshold for the Levenshtein Distance filter has been relaxed to 25%.

Wheee!! Mikaey, Devil's advocate 03:00, 13 August 2009 (UTC)

Yikes! I didn't realise that the Levenshtein Distance filter was so CPU-intensive. Fair enough then ... 72 hours is a good compromise. The dates can be all muddled up for edits before February 2002, but I think that your removal of the restriction about Conversion script will help solve that problem. Graham87 06:32, 13 August 2009 (UTC)
Yeah, the algorithm basically involves comparing every "element" in one string to every other element in the other string. One of the variables is the definition of "element" -- an element could be a character, a word, or an entire line. If we're talking a character-by-character comparison of two pages that are 100k each, that's 10 billion comparisons right there. This can be mitigated somewhat by using either lines or words as elements, but sometimes you get pages like List of Turkish films by name: A (see pretext and posttext), where there's so many line breaks that it bungles up all the aforementioned comparison methods, and on a machine with a 440MHz SPARC processor, that's going to take quite a bit of time. Mikaey, Devil's advocate 06:46, 15 August 2009 (UTC)

WikiProject

So, here's the deal -- I think we should turn this into a WikiProject. I also think we should have a cool name. Yea or nay? Suggestions on the name? Mikaey, Devil's advocate 05:03, 13 August 2009 (UTC)

Yeah, I think turning it into a WikiProject would be a good idea. It already is like a WikiProject in spirit; it's similar to other maintenance WikiProjects such as WikiProject Abandoned Articles except only admins can participate. A cool name? Hmmm ... "WikiProject Abandoned edits" or "WikiProject Edit rescue" are the only ones that come to mind right now, but maybe we need something more original than that. Graham87 06:43, 13 August 2009 (UTC)
I was hoping for something a little less bland and a little more spiffy (no disrespect to your suggestions, however). I see things like Esperanza and WikiProject Conceptual Jungle, and I think, "yeah, those are awesome names". However, I'm not having any more luck than you are coming up with such a name. Perhaps I'll think of something after sleeping on it. Mikaey, Devil's advocate 07:31, 13 August 2009 (UTC)
"WikiProject Conceptual Jungle"? Now that's a cool name! The appeal of the name Esperanza has probably worn off on me because I never liked the actual project much and was glad when it was shut down. There was also Wikipedia:Concordia which wasn't successful either. WikiProject Defenestration sounds good but I don't think that will ever happen. :-) How about a name that signifies that we fix jumbled edits, or some weird name in relation to locomotion? I still can't think of anything relevant. Graham87 12:10, 13 August 2009 (UTC)
Locomotion, eh? "WikiProject History Train" maybe? Mikaey, Devil's advocate 21:08, 13 August 2009 (UTC)
Yeah, something along those lines. We're fixing edits that have gone off the rails, so to speak. Graham87 00:41, 14 August 2009 (UTC)
I'm up for the splice service. Let me know when things get off the ground. Dekimasuよ! 14:03, 21 August 2009 (UTC)
Splice service? Wha? Mikaey, Devil's advocate 23:55, 21 August 2009 (UTC)

User script

Just to let anyone watching this page know, I hope to write a user script in the next few days that will make it much easier to perform a histmerge... think 2 or 3 clicks. I'll let you know when I have a "finished copy" of the script. –Drilnoth (T • C • L) 02:26, 22 September 2009 (UTC)

I still have this on my to-do list; just haven't gotten to it yet. :/ –Drilnoth (T • C • L) 02:19, 3 November 2009 (UTC)

Be careful of seemingly blank revisions

Be careful of seemingly blank revisions from early 2005 caused by bug 20757. They appear blank because they are in an old compression format, and it's probably not a good idea to history merge them at the moment because it might make them unfixable. An example is in the history of Eurydice (mythology) between the first edit by Erolos from January 2005 and the edit by DreamGuy in April 2005; here's the relevant diff from the histmerge page. Graham87 07:01, 10 December 2009 (UTC)

The bug has been fixed, and the above-mentioned page has been history merged successfully. Graham87 05:10, 13 February 2010 (UTC)

Bug discovered and fixed

If you were working the list recently, and you noticed some of the results the bot was finding had way more differences than what the diff score would indicate -- I know about it, it's been fixed, and I'm starting to upload new pages of the report with the fixed diff scores. Sorry for the mixup. Mikaey, Devil's advocate 06:34, 10 March 2010 (UTC)

Shoulder surfing

A cut and paste move was done, with the old page being converted to a dab. See [2] and [3]. --Joshua Issac (talk) 19:50, 24 January 2011 (UTC)

Thanks for the notification, but the correct place for this message would have been Wikipedia:Cut and paste move repair holding pen. However, it is acceptable to split disambiguation pages by using cut-and-paste moves, and a history merge isn't advisable per this edit to "shoulder surfing", which incorporates two articles in one page. Graham87 03:30, 25 January 2011 (UTC)

History merging deleted edits

This might be a somewhat eccentric practice, but I often history merge deleted pages when I encounter them, in case the article needs to be restored later for some reason. Occasionally, history merges of deleted pages reveal other issues as well. In particular, see what happened at "History of religious Jewish music". Graham87 02:58, 17 February 2011 (UTC)

Special case with only 1 editor pre- and early post-copy?

Western Kenosha Country Transit and Western Kenosha County Transit is probably one of many special-cases where a single editor edited the old version and created the first edit of the new one.

A bot that would identify these articles and email or talk-page-notify the editor and ask permission to merge the histories then mark it ready-to-merge or mark it not-a-merge-candidate would clear out a large chunk of the backlog. Even better if the bot or a companion bot had the necessary user-rights to make the merge happen, but that would invite all of the problems inherent in getting an admin-bot approved. The 15%-similarity threshold could even be loosened up since it was that very editor who created the redirect. davidwr/(talk)/(contribs)/(e-mail) 18:44, 17 February 2011 (UTC)

I've done the above-mentioned history merge. I don't think we need permission in these cases ... the number of times a user edited a page can occasionally be important, so I can't think of a reason *not* to provide a complete picture of the evolution of the article. However I do agree with you that the threshold could be reduced in similar cases. Graham87 06:23, 18 February 2011 (UTC)
I don't think I was clear: If the user gave the go-ahead then there wouldn't be any need for further human decision-making - either an admin-bot could do the job or an admin could do it without having to go through any though process of "am I doing the right thing here?" which would allow him to work much faster. In cases where the user said "no" then it could be removed from further consideration. Obviously, in cases where the user didn't respond we would proceed as we do today. The goal is to push some of the decision-making from admins back to the original editors, who pretty much by definition know better than anyone if the intent was that the two articles be treated as one. davidwr/(talk)/(contribs)/(e-mail) 19:50, 18 February 2011 (UTC)
OK, I see where you're coming from now. It would be a good idea, but I don't see why it couldn't be extended to other cutt-and-paste moves ... but in those cases, ask the person who made the C&P move what their intentions were. I'm not optimistic about response rates though. Generally, making a decision about whether a cut-and-paste move should be performed doesn't take too long for me, except in rare borderline cases. It's a matter of clicking on the diff link provided by the diffscore field and checking that there aren't any overlapping edits in the relevant histories like stray redirects. Graham87 06:30, 19 February 2011 (UTC)

The new list

Whoa, it has way too many false positives! For example see the first couple of entries in the first list. Graham87 05:02, 28 October 2011 (UTC)

Just to be clear, by "the new list" you mean the list that is over a year old? I have run in to very few false positives. ErikHaugen (talk | contribs) 17:16, 28 October 2011 (UTC)
Nope, I mean the list that was created yesterday. See the link in my first message. Graham87 07:19, 29 October 2011 (UTC)
It's all better now. Graham87 05:43, 30 October 2011 (UTC)
Sorry about that! When comparing versions, the algorithm involved can take a considerable amount of time for larger articles, so my bot has a number of speed-ups in it -- particularly, there's "bail-out" scenarios where the bot can conclude that there's too many differences between the two versions and give up early. In this situation, it was figuring out "how many differences equates to the threshold (15%)", and then adding 1 to it. This seemed to work fine on the SPARC machine (where I originally wrote the code), but I moved the program over to an x86-based server, and apparently the math didn't work quite as well. The short story is, I fixed it (I upped the "bail-out" value to its highest possible maximum), and the results should be more or less accurate now. Mikaey, Devil's advocate 05:55, 9 November 2011 (UTC)

Does merger lose anything and/or interleave?

Please excuse me if these sound like they have obvious answers:

  • Does merging histories lose any diffs or edit summaries?
  • If two pages were edited at the same time and then are merged, can a visitor tell which diff comes after which diff, since they would no longer be consecutive?

Hypothetical example: An article has been live since at least Monday. A draft on the same subject is begun on Monday. Contents are combined into the live article by a copy-and-paste on the next Friday. In the meantime:

  • The live article is edited on Tuesday and Thursday.
  • The draft is edited on Wednesday.

If a history merge is performed afterwards, the result should show diffs for Monday, Tuesday, Wednesday, Thursday, and Friday but a visitor should be able to figure out that the live article had a certain content on Monday and was edited only on Tuesday, Thursday, and Friday while the draft was created on Monday and was edited only on Wednesday and Friday.

Is this the result or should we adapt to something different than this? Thanks. Nick Levinson (talk) 16:26, 6 May 2013 (UTC) (Clarified: 16:32, 6 May 2013 (UTC))

History merging should only be done when one article has been directly cut and paste into another, with no overlapping edits. Anythin else leads to madness. There are occasional exceptions to this rule, when the overlapping edits aren't very significant, but these are rare. When this rule is followed, no information will be lost during a history merge. Graham87 02:03, 7 May 2013 (UTC)

Fix request

Hi, I just came across this: Northwestern Mutual which was copy/paste moved from Northwestern_Mutual_Financial_Network about a year ago, but there is a rich history in the previous name; OTOH, the history of the Northwestern Mutual article previous to the move is rather boring. As such, I think it's a good candidate for a history merge.

There is also an issue with the talk page, which was not moved over; again this is something where the old talk page should be simply moved to the new, and the new talk page deleted, there is very little there of use or value for now. can someone with a broom take a look at this please? thanks!--Obi-Wan Kenobi (talk) 21:12, 14 February 2014 (UTC)

Thanks for the note. I've done the history merge, but I've moved the old history at "Northwestern Mutual" to Talk:Northwestern Mutual/Old history because the old content there was merged into the current article. I've also amalgamated the two talk pages. Graham87 08:15, 15 February 2014 (UTC)

Comment on the WikiProject X proposal

Hello there! As you may already know, most WikiProjects here on Wikipedia struggle to stay active after they've been founded. I believe there is a lot of potential for WikiProjects to facilitate collaboration across subject areas, so I have submitted a grant proposal with the Wikimedia Foundation for the "WikiProject X" project. WikiProject X will study what makes WikiProjects succeed in retaining editors and then design a prototype WikiProject system that will recruit contributors to WikiProjects and help them run effectively. Please review the proposal here and leave feedback. If you have any questions, you can ask on the proposal page or leave a message on my talk page. Thank you for your time! (Also, sorry about the posting mistake earlier. If someone already moved my message to the talk page, feel free to remove this posting.) Harej (talk) 22:47, 1 October 2014 (UTC)

WikiProject X is live!

Hello everyone!

You may have received a message from me earlier asking you to comment on my WikiProject X proposal. The good news is that WikiProject X is now live! In our first phase, we are focusing on research. At this time, we are looking for people to share their experiences with WikiProjects: good, bad, or neutral. We are also looking for WikiProjects that may be interested in trying out new tools and layouts that will make participating easier and projects easier to maintain. If you or your WikiProject are interested, check us out! Note that this is an opt-in program; no WikiProject will be required to change anything against its wishes. Please let me know if you have any questions. Thank you!

Note: To receive additional notifications about WikiProject X on this talk page, please add this page to Wikipedia:WikiProject X/Newsletter. Otherwise, this will be the last notification sent about WikiProject X.

Harej (talk) 16:57, 14 January 2015 (UTC)

Indication

How are we (non-admins) who request histmerge using {{histmerge}} supposed to indicate this on the reports? SD0001 (talk) 16:11, 30 March 2015 (UTC)

Delete the entry once the history merge is done, I guess. Graham87 09:37, 31 March 2015 (UTC)
But that has never been done. Because all the reports contain 500 items (as far as I have looked). So I will just mark a small "Done" next to the serial number, using the code <br><small>Done</small> immediately next to the serial number, as I have done to three entries on WP:WikiProject History Merge/18. SD0001 (talk) 13:26, 31 March 2015 (UTC)
Yes it has ... like this. The lists do get updated from time to time, and the new lists always contain 500 items (except the last list in the series). Graham87 13:42, 31 March 2015 (UTC)
Well, the practice of removing entries does not look like a good practise. It is dangerous since anyone can just remove entries without actually doing any histmerge. This sort of vandalism is unlikely to be ever detected. SD0001 (talk) 14:56, 2 April 2015 (UTC)
"This sort of vandalism" isn't particularly serious, given the length of these bot-generated lists, and would be "detected" by the bot on its next run, which would simply re-add non-fixed pages to the lists. Wbm1058 (talk) 14:43, 8 September 2015 (UTC)

Revamp

What is the use of a WikiProject without any to-do list or even a list of participants? The creator has certainly misused the term WikiProject for what was more appropriately called the New histmerge list. A WikiProject is a group of people organised to do a particular task. So, I am going to add a sectiom here promoting the use of {{histmerge}} and the Cut-and-paste-move repair holding pen, as well as create an elementary to-do list and a list of members. Only then this would truly become a WikiProject. SD0001 (talk) 13:35, 31 March 2015 (UTC)

Go for it! They are good ideas. Graham87 13:43, 31 March 2015 (UTC)

WikiProject category

I have put together a number of pages into Category:WikiProject History Merge. It is requested that the category page be created. 103.6.156.167 (talk) 19:39, 30 April 2015 (UTC)

Category page created by User:Anthony Appleyard. Thanks. 103.6.156.167 (talk) 18:38, 1 May 2015 (UTC)

WikiProject activity

This talk page (and thus presumably the project itself) appears to have been fairly active until 2009. What really happened after that? 103.6.156.167 (talk) 18:38, 1 May 2015 (UTC)

There was a brief spurt of interest at the start, but it slowly died away. I for one am most interested in the very early history of Wikipedia, so I found the history merge lists from that time period to be fascinating. Graham87 12:03, 2 May 2015 (UTC)

Discussion regarding parallel history moves undergoing at Wikipedia talk:How to fix cut-and-paste moves

There is currently a discussion regarding how to title parallel history page moves at Wikipedia talk:How to fix cut-and-paste moves#Discussion regarding titling standards for moving parallel histories. Members of this WikiProject may be interested in participating, if you have not done so yet. Steel1943 (talk) 18:25, 3 September 2015 (UTC)

Page movers proposal

It has been suggested that the to-be-created Page mover group be granted the ability to perform history merges through Special:MergeHistory. Participate in the discussion at WT:Page_mover#Discussion:_Include_MergeHistory_tool. 103.6.159.86 (talk) 10:41, 16 May 2016 (UTC)

Bot for category history merges

I have just requested a bot to fix the thoudands of cut-and-paste moves performed by cydebot in the days when category moves were not supported. See Wikipedia:Bot_requests#Bot_for_category_history_merges. 103.6.159.72 (talk) 10:56, 18 January 2017 (UTC)