User:Mill 1/Project Chaining back the Years
“ | "This is going to take me ten years." I thought. In the end it took only six. | ” |
Preface
[edit]The articles that list the recent deaths consistently rank among the most popular on Wikipedia.[1] However, it must have been in the summer of 2018 that I first got interested in the older versions of them. At the time the dead were listed per month ('dpm's') and per year ('dpy's').[2] I noticed wild differences between them in formatting, guidelines, coverage and sourcing. An explanation is that presently the dpm of the current month is edited intensively as the month progresses. And a lot a watchers make sure the guidelines are followed during and after the running month. However, this was not the case for pages listing deceased in the pre-Wikipedia era; they were put in dpy's afterwards.[3] Annoyed by the discrepancies between these types of pages I set out to standardize the formatting of the dpy's first.[4].
During that time I noticed something else which would become the main motivation for the initial phase of this endeavour: days were missing! There seemed to be days that nobody had died. This could not be and my OCD-tendencies immediately kicked in. An idea formed in my head: why not create dpm's for all months going back to 1995? It would solve the issue of the dpy's becoming very long and I could add missing days when processing a month. I would take the year 2005 as a starting point because I noticed that from 2006 onwards at least one deceased is listed for every day until the present.[5] I remember flinching at the idea when I realised I had to process more than 4000 days. "This is going to take me ten years." I thought. In the end it took only six.
These pages try to give an overview of the activities envolved during the project that I dubbed Chaining back the Years.[6] It also states some interesting milestones and statistics[7].
Caveat lector
[edit]This documentation is a personal account of my time spent on the project. Consequently it wields some self-invented jargon[8] and covers subjects that will only be of interest to me. For anybody looking for information on the context and history of the deaths lists as well as tooling and considerations when editing them these pages can be useful but don't say that I didn't warn you.
Three rounds
[edit]In hindsight improving the deaths lists fell apart in three separate rounds of activities during which each existing dpm and dpy was processed.[9]
- Round 1: Breaking up the Deaths in Years (September 2018 – October 2020)
- Round 2: Adding NYTimes references (November 2020 – October 2021)
- Round 3: New rules: let's process every day (again) (November 2021 – November 2024)
You can find information on the initial versions of the dpm's per round here.
Round 1: Breaking up the Deaths in Years
[edit]Period: September 2018 – October 2020
Articles: Deaths in January 1997 – Deaths in December 2005
The first phase started by making the dpy's even longer before forking them into twelve separate dpm's. Regarding every month I needed to perform checks, find missing notable deceased for the list ('entries') and compile the wikitext that I could paste in a dpm. Obviously this was way too much work to accomplish by hand.
So before beginning I extended the functionality of the Excel application that I had already used for several other projects. It would proof to be indispensable when processing a month:
Processing a month using the Excel application
[edit]1. Dpm checks
[edit]Before entries would be added/updated, the month at hand would be checked. Existing entries in the list would be cross-referenced with their corresponding bio's to look for discrepancies:
- Are the existing entries in the correct day sub section?[10]
- Do the existing entries link to a valid biography article?
- Do the corresponding bio's contain the correct "[YEAR] deaths" and "[YEAR] births" categories?
- Does every day sub section contain at least two (later three) entries?
2. Process specific days of the dpm
[edit]After the initial checks the actual work on the article could commence. At first, I focused on filling the gaps in the days of death but soon I decided every day should contain at least two entries.[11] Processing a specific date started by clicking the 'Chk'-button in the 'Death per date' worksheet. Next tasks would be executed:
- Resolve the list of bio's whose subject had died on a specific date.[12] More info can be found here.
- Show the list alphabetized and per bio display if it is a stub or has any 'problem flags' like
{{multiple issues}}, {{Notability}}, {{Unreferenced}}, {{mcn}}, {{One source}}
. - Apply custom filtering to the list. More info on that here
- Per bio try to resolve next parts of an entry by analyzing the bio's wikitext:
Result filtering
[edit]From the start it was also clear to me that some inclusion filtering needed to be applied to the found new (and existing) entries. On some days more than 30 persons with a bio died. Stating them all would make the dpm's unwieldy and error prone. And lesser figures (often stubs and virtual orphans) distract from more notable entries.
So I experimented with conditions like not being a stub or having problem flags. Did not work. However, the tool looked for the date of death (DoD) only in the infobox of the person's bio. As a consequence, a biography having an infobox acted as a first filter. I also made the application look at the bio's text size (excluding the text in the infobox and stated categories, the 'net size'). This was the second filter. I settled for 4000 characters as the minimum 'net size' of a biography. This first attempt at grading WP:N worked, but it never sat well with me (and others). It was one of the reasons to initiate round 3.
3. Concluding processing a month
Processing a month would be concluded by two manual activities:
- Search for additional causes of death regarding the entries[15].
- Are there any 'reason for notability'-descriptions in the entry that needs trimming?
Chronology of activities
[edit]Work started on 1 September 2018 by applying the same format to sections and list entries to all the dpy's[4] The first couple of months I worked on the 24 existing dpm's of 2004 and 2005, processing each month assisted by the Excel tool as described. At the same time I was also in the process of finishing another project.
On 2 February 2019 I standardized the guidelines and day sub sections of all dpm's between 2004 and 2015. Applying those changes finalized the first round of improvements regarding the dpm's of 2004 and 2005.
I could now focus on the Deaths in Year-pages. Next list shows when an entire year was completed after which it could be split up into 12 dpm's, finalizing their first round of improvements:
- 2003: 10 February 2019
- 2002: 17 February 2019
- 2001: 17 February 2019
- 2000: 23 February 2019
- 1999: 19 May 2019
- 1998: 9 November 2019
- 1997: 4 October 2020
Regarding 1998 and 1997 (and 1996) a new dpm was created right after a month had been processed. The 12 dpm's were not created simultaneously anymore as is explained here. Processing of the years 1993-1998 was done in this processing page which would be initialized every time after a dpm was completed.[16][17]
Round 1 saw one final improvement. From the beginning I had noticed that the dpm's lacked references citing the deceased date (and cause) of death. I had started adding some citations to entries but it seemed to be a drop in a bucket. That's why I introduced a new feasable requirement: at least one reference per day sub section.[18] Around June 2020 I first started thinking about automating citations. The archive API of The New York Times especially offered great possibilities. So I wrote some code to experiment interacting with the NYTimes API to retrieve obituary data and create citations from them. I pasted the output in another processing user page: /References/The New York Times. The results were spectacular. I could now use this list of generated references as a source. So after processing a day I would also manually add citations of matching entries to the day sub section of the dpm. The first month I processed this way was September 1997. I worked my way back to January 1997, improving and bugfixing the code.
Eventually the software evolved into the WikipediaReferences-application. You can read more about it here. On November 14, 2020 (I learned from the GitHub commit) the application was finally able to add NYTimes-references to the corresponding entries of an entire dpm automatically. I decided to reprocess all the existing dpm's (1997 – 2005) so that their number of stated references would increase considerably. Work started with January 1997 on the same day heralding the start of the next round.
Milestones
[edit]- 1 September 2018: the first edit is made
- 10 February 2019: the first dpm is created
- 10 February 2019: the first dpy is nuked
- 2 March 2019: all dates since the start of the millenium to date have been accounted for
- 11 July 2020: the first day sub section is processed including generated NYTimes references
After 25 months round 1 was concluded by creating the last dpm of 1997.[19] By this time I already must've decided to extend the 'chaining back' period back to January 1990
Round 2: Adding NYTimes references
[edit]Period: November 2020 – November 2021
Articles: Deaths in January 1995 – Deaths in December 2005
As already described at the end of the previous section the succes of WikipediaReferences application prompted me to re-process all dpm's that existed at that time (November 2020). Automatically adding NYTimes references using the tool would also become the additional third activity when wrapping up a month (see 3. Concluding processing a month in Round 1 for the other two activities).
Processing a month using the WikipediaReferences application
[edit]Processing a particular dpm usually consisted of these steps:
- After the regular processing of a dpm was concluded and the last entries were added/updated I would run the software to evaluate a dpm. See screenshot: I would select 'p', followed by some input to tell the application which month and which Wikipedia source page to process.
- First the app would perform initial checks like looking for duplicate entries. The process is aborted if any issues are encountered.
- If the initial issues are resolved the month in question is evaluated by comparing the NYTimes obit data with the entries in the dpm. After that the app offers to generate the the wikitext, including the added/updated references. However this was seldom the case. In most cases other actions were required first after which the evaluation was run again. Two types of actions exists:
- If NYTimes obituary data exists for a listed entry than the resolved death date in the obituary is compared with the date of death in the entry's corresponding bio. Very often discrepancies would exist. One reason is that the death date stated in the bio is wrong.[20]. These discrepancies had to be corrected first.
- The software would also spot potential entries: regarding the particular month NYTimes obit data would exist for bio's that were not present in the dpm. In fact, some many potential entries were suggested that I applied a notability filter on them.[21] I would add most of the suggested entries manually to the dpm source page.
- After the correction/additions step I would re-run option 'Print month of death'. Sometimes several times until no more issues were encountered by the application.
- After succcesful evaluation of the dpm I would instruct the app to generate the wikicode in a text file.
- Processing a specific dpm is concluded by pasting the contents of the text file in the source page of the dpm and checking the result.
Chronology of activities
[edit]Right after I uploaded the last code changes I started using the software on the existing dpm's. I really hit the ground running processing the years 1997 - 2000 within 6 weeks, adding and updating over a thousand citations (as well as adding quite a few entries suggested by the application).
By September 2021 I had processed all existing dpm's, increasing the number of references on a page considerably.[22]. I could now resume my efforts in the processing page were I prepared brand new dpm's starting with Deaths in December 1996. By now the software was firmly embedded in the way of working.
1995
[edit]However, work was interrupted by another job. An editor had forked Deaths in 1995 into 12 dpm's without any regard for the different style and format, after which he added many entries. It took me a sh*tload of time bringing the new dpm's up to par.[23] The task involved a lot of corrections by hand as well, adding causes of death, shortening entry descriptions, meanwhile battling this lunatic. When cleaning up 1995 I also identified many unnotable entries, many of whom didn't even have an enwiki bio. And by this time I already decided to reprocess all the days of existing dpm's partly to apply the new notability algorithm to entries. This would mean that many 1995 entries would be cleansed from the lists. That's why I decided it would be a huge waste of time applying the WikipediaReferences tool to the 1995 entries; it would take a lot of effort correcting entries that would be removed at a later stage anyway. This is the reason why (alhough chronologically incorrect) this was actually a Round 1 job.[24]
Milestones
[edit]- 14 November 2020 the first dpm is processed using the fully functional WikipediaReferences application
- 12 September 2021: processing the first new dpm using the tool
Still using the wiki_client Excel tool, Round 2 came to an end on 31 October 2021 with the creating of Deaths in September 1996
More details on the progress regarding Round 2 can be found here. In the table click on on title 'Round 2' to sort on the date when the processing of a dpm was finished.
Round 3: New rules: let's process every day (again)
[edit]Period: November 2021 – November 2024
Articles: Deaths in January 1990 – Deaths in December 2005
So by now I've been at it for a couple of years and during that period two issues started bugging me more and more:
- The notabilty algorithm is faulty; I'm adding entries whose bio's are semi orphans. At the same time I miss notable entries because their bio's don't have infoboxes.
- Most entries do not have citations. After completing Round 2 this was improved somewhat but many dpm's now contain references that almost exclusively point to The New York Times as a source.
Wikidata
[edit]During my activities I had come across Wikidata when inspecting bio's. At some point I must have noticed that the data stored in a human Wikidata item could serve my purposes, especially these data properties:
- Item's description (=reason for notablity regarding humans)
- Date of death (DoD) statement
- Date of birth (DoB) statement (needed to resolve an entry's age)
- Cause of death statement
- Number of wiki's in which the human is present
Investigating the Wikidata query capabilities made me realise that using Wikidata as a source offered huge advantages over using an entry's corresponding Wikipedia page. It would help me regarding the two issues, resolve the cause of death automatically and offer an alternative for the description part of an entry to generate.[25] There was also one final perk using Wikidata as source: the death date statement of many items contained references supporting the claim. This information could be used to generate references for entries automatically when processing a dpm. These were all great improvements. I realised that I had to re-process every day between 1990 and 2005 AGAIN. But since it was clear that it would hugely increase the quality and reliabilty of the dpm's I decided in a heartbeat I would do it. I still had to create the software though which ultimately would become the WikipediaDeathsPages web application.
At the heart of the app would be the query that would fetch the Wikidata data regarding a specific date of death. Unfortunately I am unfamiliar with the SPARQL query language. Luckily Wikidata:Request a query exists. With the help of volunteers over the course of a couple of months I was finally able to define the query. As input it would only require the date of death. The output is shown below as a table. As you can see it contains the basic data (alphabetized by article name!) I needed to generate the entries for a specific day (in this case 25 August 2001):[26]
item | articlename | itemLabel | itemDescription | sl[27] | dob | dod | dod_refs[28] | cod[29] | mod[30] |
---|---|---|---|---|---|---|---|---|---|
Q11617 | Aaliyah | Aaliyah | American singer and actress (1979–2001) | 69 | 1979-01-16T00:00:00Z | 2001-08-25T00:00:00Z | stated in: Nederlandse Top 40~!stated in: Find a Grave~!Find a Grave memorial ID: 5727911~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Aaliyah~!subject named as: Aaliyah Dana Haughton~!Nederlandse Top 40 artist ID: aaliyah~!stated in: Integrated Authority File~!retrieved: 2014-04-09T00:00:00Z | aviation accident | accidental death |
Q3298163 | Madge Adam | Madge Adam | English solar astronomer (1912-2001) | 15 | 1912-03-06T00:00:00Z | 2001-08-25T00:00:00Z | stated in: Who's Who~!Who's Who UK ID: U4983~!imported from Wikimedia project: English Wikipedia | ||
Q6779010 | Mary Barnard | Mary Barnard | American poet and translator (1909-2001) | 3 | 1909-12-06T00:00:00Z | 2001-08-25T00:00:00Z | stated in: SNAC~!stated in: Find a Grave~!Find a Grave memorial ID: 6318601~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Mary Barnard~!subject named as: Mary Ethel Barnard~!SNAC ARK ID: w60s047j | ||
Q1037163 | Carl Brewer (ice hockey) | Carl Brewer | Canadian ice hockey player (1938-2001) | 9 | 1938-10-21T00:00:00Z | 2001-08-25T00:00:00Z | stated in: SNAC~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Carl Brewer~!SNAC ARK ID: w6f76nsq~!stated in: Find a Grave~!Find a Grave memorial ID: 8466339~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Carl Thomas Brewer | ||
Q10294559 | Helmut Bruck | Helmut Bruck | German officer and Knight's Cross recipient | 3 | 1913-02-16T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | ||
Q93784 | John Chambers (make-up artist) | John Chambers | American make-up artist and prosthetic makeup expert | 12 | 1923-09-12T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: Italian Wikipedia | ||
Q8079499 | Üzeyir Garih | Üzeyir Garih | Turkish businessman | 4 | 1929-01-01T00:00:00Z | 2001-08-25T00:00:00Z | |||
Q3547943 | Diana Golden (skier) | Diana Golden | American alpine skier (1963-2001) | 6 | 1963-03-20T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | breast cancer | natural causes |
Q6033955 | Inigo Jackson | Inigo Jackson | actor (1933-2001) | 1 | 1933-07-19T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | ||
Q155493 | Philippe Léotard | Philippe Léotard | French singer and actor (1940-2001) | 21 | 1940-08-28T00:00:00Z | 2001-08-25T00:00:00Z | GND ID: 119002469~!stated in: Roglo~!stated in: Integrated Authority File~!stated in: GeneaStar~!stated in: Who's Who in France~!stated in: Find a Grave~!Find a Grave memorial ID: 5860980~!retrieved: 2015-10-18T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2017-10-09T00:00:00Z~!subject named as: Philippe Leotard~!Who's Who in France biography ID: 25159~!Roglo person ID: p=philippe;n=leotard~!GeneaStar person ID: leotardp~!stated in: filmportal.de~!stated in: BnF authorities~!retrieved: 2017-10-09T00:00:00Z~!retrieved: 2015-10-10T00:00:00Z~!reference URL: http://data.bnf.fr/ark:/12148/cb12070631t ~!subject named as: Philippe Léotard~!Filmportal ID: 0216ac0cf8fb4ce3a3e417812c4a5a72 | respiratory failure | natural causes |
Q3764794 | Ginzō Matsuo | Ginzō Matsuo | Japanese actor, voice actor and narrator | 8 | 1951-12-26T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | ||
Q6243659 | John L. Nelson | John L. Nelson | American jazz musician, songwriter, father of Prince | 6 | 1916-06-29T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | ||
Q862381 | Bill Pratney | Bill Pratney | New Zealand cyclist (1909-2001) | 2 | 1909-05-20T00:00:00Z | 2001-08-25T00:00:00Z | |||
Q5671841 | Harry Ramberg | Harry Ramberg | Swedish tennis player | 4 | 1909-04-06T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: Swedish Wikipedia | ||
Q4807036 | Asit Sen (director) | Asit Sen | film director | 6 | 1922-09-24T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: English Wikipedia | ||
Q106222009 | Ben Oumar Sy | Ben Oumar Sy | Guinean footballer and manager | 1 | 1926-01-08T00:00:00Z | 2001-08-25T00:00:00Z | |||
Q173413 | Ken Tyrrell | Ken Tyrrell | Racing driver and Formula one team owner (1924-2001) | 18 | 1924-05-03T00:00:00Z | 2001-08-25T00:00:00Z | imported from Wikimedia project: Russian Wikipedia~!stated in: Encyclopædia Britannica Online~!retrieved: 2017-10-09T00:00:00Z~!Encyclopædia Britannica Online ID: biography/Ken-Tyrrell~!subject named as: Ken Tyrrell | pancreatic cancer | natural causes |
Rethinking notability
[edit]As already explained the algorithm that decided if a deceased should be listed was flawed. I had already noticed that more relevant people appear on more wiki's (winner). I also came to believe that more links to a bio suggests greater notability. The Wikidata query returned the number of site links per entry. The Wikipedia link count api could resolve the number of incoming links. At some point I came up with the concept of the "notability score" of a potential entry. This score is expressed as a product of the two aforementioned data points. For instance take John Chambers (make-up artist):
Number of site links: 12 (see column 'sl' in above table)
Number of pages linking to the bio: 237 (Link Count tool result, API result)
Hence John's notability score would 12 * 237 = 2.844
After much experimenting I settled for a minimum score of 48[31] for an entry to be listed. Although still not perfect it worked way better than the previous algorithm, with this as the end result.
References, revisited
[edit]Wikidata references
[edit]When I was building the Wikidata-query I had noticed that some online sources were stated quite often as references for death date statements for humans. Because of the structured way this information was stored I could use it to generate citations fo my entries. Obviously the online source is checked for existence and its contents searched for the date of death (DoD) before the information is used to create a reference.
Next sources are evaluated, in following specific order:
- Encyclopædia Britannica
- The Guardian
- The Independent
- Internet Broadway Database
- DB~e
- Biografisch Portaal
- FemBio
- filmportal.de
- Fichier des personnes décédées
This is an example of a generated reference based on the Wikidata DoD statement claims of José Craveirinha:
<ref>{{cite web |last1= |first1= |title=José Craveirinha |url=https://www.britannica.com/biography/Jose-Craveirinha |website=britannica.com |publisher=Encyclopædia Britannica Online |access-date=24 December 2023 |language= |date=}}</ref>
[32]
Sports sites references
[edit]During implementation of this I discovered an alternative way of automatically utilizing online sources. Websites use specific url patterns to identify resources on the host. Some of the websites use name-based patterns. For instance the site Cycling statistics uses next url to identify rider Jacques Anquetil:
https://www.procyclingstats.com/rider/jacques-anquetil
Knowing the specific pattern I could 'guess' url's using the label name of an entry. When processing DoD November 2, 2004 for instance rider Gerrie Knetemann would be one of the deceased returned by the Wikidata-query.
The software would send https://www.procyclingstats.com/rider/gerrie-knetemann as a request. If the web page exists its html is searched for the DoD.[33] If encountered the web page can now act as a citation and next web reference is generated:
<ref>{{cite web |last1= |first1= |title=Gerrie Knetemann |url=https://www.procyclingstats.com/rider/gerrie-knetemann |website=procyclingstats.com |publisher= |access-date=16 December 2023 |language= |date=}}</ref>
[34]
This way of looking for citation sources is done when no Wikidata DoD-references were encountered. The mechanism was applied to next (sports) web sites, in following order:
- baseball-reference.com
- pro-football-reference.com
- basketball-reference.com
- hockey-reference.com
- olympedia.org
- worldfootball.net
- procyclingstats.com
- where2golf.com[35]
Note: To decrease the number of http requests per entry I first looked in the entry's bio to determine if the person was known for any of the sports being evaluated. Only then the url would be compiled and called.
Second tier Wikidata references
[edit]If no sports site reference could be resolved next Wikidata reference sources are evaluated (in that order):
Since these sources are stated very often as Wikidata DoD claims they now appear in abundance as references in the dpm's:
<ref>{{cite web |last1= |first1= |title=Jeanne Stuart - Social Networks and Archival Context |url=https://snaccooperative.org/ark:/99166/w6qp9q9c |website=snaccooperative.org |publisher= |access-date=24 December 2023 |language= |date=}}</ref>
[36]
I finally had established an acceptable way of resolving notabilty and generating citations. Now I only had to cast it into a userthat would be mefriendly solution.
TODO verwijzen naar statistics over de gegenereerde refs.
Wikipedia Deaths Pages
[edit]From the start it was clear the solution was to be a web application. Because of the amount of text a console app would not be suitable and by then I had enough experience using web application framework Angular that I felt comfortable creating a single-page application to meet my front end needs.
I can not determine when I started developing the web site. Fact is that the new software was first used on 16 November 2021 (see Milestones). A lot of tweaking to the code followed in the following weeks. I remember expanding the citations functionality and bugfixing the Wikidata query.
When the first version was released the site contained all the functionality to process a dpm the way the Excel tool did, but with the implemented improvements.
To achieve this, functionality present in the Excel tool had to be programmed again for instance:
- Initial dpm checks
- Resolving data in the entry's bio, for instance the entry's description
- Numerous text manipulation functions
More in-depth information on the app can be found here. But how was the web site used when processing a dpm?
Processing a month using the Web application
[edit]A dpm article would be updated by following steps
- Perform the initial dpm checks. Consult #1. Dpm checks in Round 1 for specifics. Additional checks were looking for article redirects and named references. See the screendump for an example of the checks results.
- Any issues found have to be solved first e.g. moving an entry to the correct day-subsection in the dpm, fixing redirects, removing nowiki-entries, adding categories or correcting the DoD in the entry's biography.
A special kind of issue was the following: Wikidata was the source for the entries of a specific date. Wikidata items were initially based on the corresponding Wikipedia articles, at least regarding biographies. Also Wikidata is not updated automatically when the bio changes. Therefor over time discrepancies would/will arise between the statements in the bio and the corresponding Wikidata item. Since I processed dpm's per date obviously the discrepancy that hurt me the most were different Wikidata statements on the date of death (P570). My web app would spot these differences after which I was left with two choices: change (the DoD in) Wikipedia or Wikidata. The thousands of edits I made to both biographies and Wikdata shows that this issue was quite common. It also often occurred that at the time the Wikidata item was created the Wikipedia bio only stated the year of death of the deceased (which would become the P570 statement). Since then the bio would be updated with the actual DoD and end up in a dpm. As a result the issue would be spotted when processing a death date and I would have to correct the P570 statement in Wikidata changing the YoD to the DoD. I just now realised that I would miss out on notable entries if the date of death was not stored as a real date in Wikidata AND the person was not listed as an existing entry. It looks to me that this situation acts as a filter for notability on its own. - If all issues are solved processing the days in the dpm can commence.
Code excerpt
[edit]Example of the C# code handling a piece of the challenge to determine the description part of the entry (which denotes the reason for a person being notable).
public string ResolveDescription(string wikiText) { wikiText = RemoveReferences(wikiText); string description = GetInitialDescription(wikiText); if (description == null) return null; description = description.Replace("U.S. ", "American ", StringComparison.OrdinalIgnoreCase); // because of the end candidate '.' description = description.Replace("United States ", "American ", StringComparison.OrdinalIgnoreCase); // Trucate string; [,] [perhaps/probably] [best] known [mostly] for .. etc. string[] endCandidates = new string[] { "Infobox", "infobox", "{|", "{{", " who ", " whose ", " notable ", " noted ", " known ", " better ", " spanning ", " originally ", " widely ", " responsible ", " remembered ", " best ", " most ", " perhaps ", " reputed ", " born ", " considered ", " particularly ", "." }; int posEnd = GetPositionDescriptionEnd(description, endCandidates); if (posEnd == InitialPosEnd) throw new InvalidWikipediaPageException($"None of the {endCandidates.Length} 'description end' candidates found (including '.') within {InitialMaxLengthDescription} chars from 'description start'. Change the opening sentence of the article. Description: \r\n{description}"); description = description.Substring(0, posEnd); return RemoveWikiLinks(description); } private string GetInitialDescription(string wikiText) { string[] descriptionStarts = new string[] { " was a ", " was an ", " was the ", " was one of ", " was " }; // " was " LAST! int pos = GetPositioninWikiText(wikiText, descriptionStarts); if (pos == -1) return null; return wikiText.Substring(pos, Math.Min(InitialMaxLengthDescription, wikiText.Length - pos)); }
Chronology of activities
[edit]Round 3 was kicked off on November 16, 2021 when Deaths in August 1996#1 was generated using the new software. I started using the web application when I was in the middle of processing year 1996. This made the statistics on 1996 somewhat confusing which is explained on the 1996 Statistics page. After I completed 1996 in December 2021 I went on to process 1994. Processing 1994 was interrupted with the 1995 business, as explained. When 1995 was sorted I could resume processing the next year: 1997. After I finished 1997, to keep things interesting for myself, I started alternating the years of dpm's to process; after completing one or more dpm's regarding a year that was not processed before (years 1990-1993)[37] I would re-process a year with existing dpm's. I continued working this way until processing 1991 in September 2022. By that time I felt it was time get to the bottom end of the processing period: January 1990. I noticed something regarding these early years: it was getting increasingly more difficult to compile proper death day sections. I would often struggle to find four entries with the required minimum of two references per day.[38] But on 5 February 2023, 4.5 years after starting this endeavour, this edit completed 1990. I could now start reprocessing the remaining years 1999-2005 which would take me another 21 months. As you can see in the chart[39] starting with April 1999 (dpm 112) I re-processed the deaths lists in a chronological logical order.
Another tool: WikidataEditor
[edit]Generating references automatically using Wikidata worked like a charm. But I was still adding many citations manually using the RefToolbar. Over time I realized that I could prevent these tedious tasks by using Wikidata; see the section on #Wikidata_references: if I would add a reference to a date of death statement ('P570') regarding the Wikidata item of the deceased being listed than the citation would be generated automatically when processing a day using the Web application. Requirement: the reference URL in the statement would have to point to any of the reference sources supported by the Web application (like the NYTimes or The Guardian). This actually worked! Concerning the citation the meta data like title, author name and date of publication were automatically scraped from the html when following the reference URL stated in the P570. This data would then be put in the generated reference. Downside was that I had to wait a bit for Wikdata to process that changes to the Wikidata item. Sometimes I had to wait more than two minutes. Another type of Wikidata edit I made frequently was adding or correcting the date of death itself. Often an outdated DoD or the year of death was stated that needed correction before I could let the Web app process the particular date.[40]
This was all great but after 2k of manual edits in Wikidata I started thinking about automating this process too. The result was the WikidataEditor. From what I can see in the Github repo I started working on the solution towards the end of July 2023. It was a lot of fun developing this tool which took a couple of months. I even uncovered a major bug in the Wikidata REST API during the process! After the bugfix the editor worked great (despite the somewhat crude GUI). All you had to do was enter the DoD and reference url, click 'update' and the P570 statement was changed in Wikidata!. After waiting a couple of minutes the Wikipedia Deaths Pages application would be able to generate a citation based on the added Wikidata data! In the end the time I had to wait for Wikidata to process the changes proved to be too long. I think I only made about 200 edits to Wikidata using my editor. From June 2004 onwards I resorted again to changing Wikidata items manually.
One last change to the notability rules
[edit]In October 2023 I was well into processing the year 2002 when I decided to address one final thing that had been bugging me for long; the part of the software that determined the number of links to a particular biography (the 'link count') was faulty. As a result entries ended up in my lists that were not notable enough. I already suspected the reason for this phenomena but encountering Kudrat Singh was the final straw: I asked for help to deal with this issue. Consequence was that I had to change the software this late in the project. It could potentially have a significant effect on the results. Take for instance John Chambers (make-up artist). Initially the link count was 257. Based on the new algorithm it would only be 46 (in the web app I would parse the API result[41] by the way)! Luckily work since then indicated that the results were not as dramatic as initially feared. I regret not addressing this issue sooner because a lot of unnotable entries ended up in the dpm's because of this. Oh well, the fact that a person was listed in a transcluded template should have accounted for some notability using the original link count algorithm. One thing was sure: I was not going to re-process all the dpm's again. That would have meant removing a lot of entries from the lists and having more discussions like this. It would have been worse if I had missed notable entries instead of the other way around. In that sense I'm glad that I didn't incorporate pageviews in the algorithm; the volatility of that indicator could have resulted in missing entries.
Digital magnet fishing
[edit]During the course of Round 3 I increasingly appreciated the way the automatic creation of reference worked. Although the new notability rules were the prime motivator for this last round, I quicky realised that the major gains lay in increasing the reference density[42] of the dpm's. In a way I saw the whole exercise as digital magnet fishing: I threw a bunch of new and existing entries in my tool a waited to see how many references clung to them.[43] In the end I must have generated 20k citations this way. Adding references had one big additional benefit: it camouflaged the fact that I removed numerous unnotable entries that plagued the dpm's. I made edits usually per death date. In such an edit I would remove unnotable entries, add my own and add citations. As a result the content size of the dpm would grow in almost all cases. This shows op in green in the Page History of a dpm. Just removing entries (which show up in red) would have led to much hassle by wikipedians watching these pages.
The final sprint
[edit]Starting 2024 the project started to feel more like some kind of obligation and I was looking forward more and more to complete it. I created the first documentation of the project with user page User:Mill 1/Deaths in Month article inits. It served the purpose of trying to figure out what I had been doing the past few years but also to provide a list of remaining dpm's that I could tick off once I had processed them. In the meantime I reluctantly fixed November and December 1989 (long story) but I was adament not to let myself be sucked into processing the eighties as well. I did apply the WikipediaReferences application to some eighties dpm's compiled by Braintic/Bryan Krippner but that was it.[44] I had some really interesting discussions by the way with Briantic like this one just before he got himself banned (and banned again as Bryan Krippner).
Finally, with this edit on 2 November 2024 it was finished. I spent the next few days adding references to death dates that fell below the minimum reference density[42] of 30%.
The WikipediaChecks application
To determine death dates that fell below 30% reference density I made use of yet another tool I created for the project: the WikipediaChecks application. This web tool generates statistics of dpm's in terms of number of entries, references and article sizes (in bytes) per dpm or processed year. It proved indispensible in locating areas of improvement when analyzing the dpm years. It also was crucial when compiling the project statistcs when I finalized the documentation of the project. I copied all the html tables in a Microsoft Excel file where I did my thing. After that I created some VBA script that generated wikitables text (including heat map colors!) based on the tables in Excel. The outputted wikitext could subsequently be pasted in the year pages of the statistics, like /Statistics/1997.
The last edit
[edit]The last official edit I made as part was done on November 6, 2024 at 22:15 concluding project Chaining back the Years. All that was left was documenting the project which turned out to be a mammoth task in his own right.
Milestones
[edit]- 16 November 2021 The first day was generated using the new software.
- 5 February 2023 1990 is completed.
- 2 November 2024 The last dpm is processed using the new rules.
- 6 November 2024 The last edit was made bringing the minimum percentage to 30% reference density[42] regarding all the days of 1990-2005.
Side effects (under construction)
[edit]TODO During the entire process I would find many errors in the analyzed bio's. I must have corrected thousands of bio's during the course of this project. The most common fixes to bio's:
- Adding the nationality of a person in the opening sentence
- Adding the date of death of a person[45]
- Correcting the date of death of a person
- Correcting categories regarding the year of death (and birth)
Also:
- Wikidata is not magically updated when Wikipedia content changes. As a consequence I made some 3,000 edits in Wikidata to sync the death (and often birth) data.
- Seven repositories om GitHub containing Wikipedia-related software
- Created articles, f.i. Lesley Cunliffe and Kambara Tai
- Created new dpy's for the years Deaths in 1980 – Deaths in 1989 because in 'datum' the were removed from the Year-pages (half of them already have been redirected to dpm's)
- Currently (9 Dec. 2024) I rank number 16289 as Wikipedian with the most number of edits (65,485)!
Statistics
[edit]Some interesting facts and statistics regarding the project that covered the period 1990-2005:
Counts per 6 November 2024
[edit]- Total number of entries: 42,765 (details)
- Total number of references: 27,268 (details)
- Overall reference density[42]: 63.76% (27,268/42,765)
- Total number of added content (approx.): 1,850 pages (A4)[46] (details)
- Total number of views for all dpm's per year (2023): 846,402 (details)
Statistics regarding the project
[edit]- Duration: 6 years and 2 months (September 2018 – November 2024)
- Number of death days processed: 5,843
- Number of created dpm's: 170[47]
- Number of added entries (approx.): 21,200 (details)
- Number of added references (approx.): 22,700 (details)
- Total size of added text (approx.): 7.9 megabytes (details)
- Which translates to approx. 1,400 pages (A4)
- Number of Wikipedia edits (approx.): 22,000[48] (details)
- Number of edits on Wikidata (approx.) (manual[49] and automated[50]): 3,000
More detailed statistics can be found here.
Epilogue
[edit]One question remains: Why? Why would anyone spend that much time on these trivial lists? Sure, I stumbled across a mess when I was looking for a challenge to help me become a better programmer. And in a way I became a slave of the applications I created; the custom and personal software worked so well that I felt the responsibilty of seeing it through. Perhaps I just wanted to leave something behind, albeit insignificant.
Or maybe, as Tony Stark put it: "Everybody needs a hobby."
References
[edit]- ^ "Announcing Wikipedia's most popular articles of 2023". Wikimedia Foundation. 5 December 2023. Retrieved 20 January 2024.
- ^ In 2018 dpm's only existed for 2004 and later. Older deceased were organised in dpy's that existed for the years 1995–2003 (most of which were getting very long at the time). The remaining deceased were listed in the year pages like this one (all removed in March 2023 by the way).
- ^ Adding recent deaths has more or less been going on since November 2001 starting with the (red link) addition of Melanie Thornton (strangely in article Deaths in 2003). From December 2003 onwards it took off in earnest, accelerating in the following years.
- ^ a b I wrote some code to help me accomplish the task.
- ^ I wrote some code to check that as well
- ^ Named after Holding Back the Years, a hit song on the first vinyl album I ever bought.
- ^ Probably interesting for me exclusively ("Wikipedia" Activities available; just add meaning.)
- ^ Be prepared for sentences like "The issue would be spotted by the Wikipedia Deaths Pages application when processing a dpm's death date and I would have to correct the P570 statement in the corresponding Wikidata item changing the YoD to the DoD."
- ^ Apart from the three main rounds other smaller improvement iterations were done as well like:
- ^ The cause of these errors is very often that the date of death in corresponding bio's had changed but was not reflected in the list.
- ^ And even later I decided every day sub-section should list a minimim of three entries. After that I did the same regarding the minimum number of references per day and so on..
- ^ Another way would have been to go through everyone listed in the category of deaths of a specific year. However, this would have meant processing the months of an entire year simultaneously. And I still would have had to query the bio's in search of the subject's date of death. Also, as I would find out, many bio's stated incorrect categories regarding the year of death (and birth).
- ^ In a lot af cases the nationality of a person was missing in the opening sentence so I had to fix the bio. Americans especially forget that the English Wikipedia is an international venture.
- ^ Causes of death of a person where suggested by displaying the first sentence in the bio that contained the string literals " murdered", " killed" or " died" (in that order). Although crude this algorithm worked well and saved me a lot of time.
- ^ I found out that above around the age of 65 the cause of death is often not stated in a bio's because, well, they just die of old age and 'natural causes' is not a valid cause of death (aproaching the age mentioned made this work a tad confronting at times)
- ^ During the course of the project a whopping total of 5078 edits were made in this page.
- ^ For undisclosed reasons 1993 and 1995 were partially processed in two other pages
- ^ This minimum was increased to two during round 3.
- ^ Actually this round was concluded when dpm Deaths in December 1995 was completed. This is explained here
- ^ It's staggering how many editors confuse the date of publication of a cited source with the date of demise.
- ^ During the course of the project the notability filter was subject to change. First I used the 'net article size filter'. This was later changed to the filter applied in Round 3: the number of incoming links to the corresponding article.
- ^ This corresponds with the gap of 10 months during which no work was done in the processing page.
- ^ I wrote some code to fix the format and some other stuff.
- ^ In hindsight, it would have saved me buckets of time if I just created the 1995 dpm's myself in a processing page and pasted them over the existing ones. Some referenced entries would have been lost though.
- ^ In almost all cases the information in the opening sentence of a bio proved to be more useful than the Wikidata description, however.
- ^ Actually the data was returned by the Wikdata as JSON after which it was deserialized to fitting objects.
- ^ Site links; the number of wiki's (including the English Wikipedia) in which the item is present.
- ^ References regarding the DoD (date of death). Data is delimited by the text '~!'
- ^ Cause of death
- ^ Manner of death
- ^ Initially this limit was 50 but soon I changed it to 48 because of its factorization qualities.
- ^ "José Craveirinha". britannica.com. Encyclopædia Britannica Online. Retrieved 24 December 2023.
- ^ The web sites used specific date formats to display the death date. Obviously this had to be taken into account when looking for the date.
- ^ "Gerrie Knetemann". procyclingstats.com. Retrieved 16 December 2023.
- ^ Not very successful, only 10 generated citations in total..
- ^ "Jeanne Stuart - Social Networks and Archival Context". snaccooperative.org. Retrieved 24 December 2023.
- ^ The creation of Deaths in December 1993 is quite funny in that regard.
- ^ The fact that the internet was not widespread in the early nineties did not help.
- ^ The source for this chart was this table, sorted on Round 3 dates.
- ^ Even more discrepancies existed regarding date of births. I correctly some in Wikidata but I soon realized I had to stop if I wanted to complete the project within my lifetime.
- ^ Increase parameter 'srlimit' to see more link search results in the JSON response.
- ^ a b c d The reference density is the number of refs / number of entries
- ^ To a lesser extend this was true for automatically adding causes of death to (existing) entries
- ^ I may apply the WikipediaReferences tool to other pre-nineties dpm's in the future but I'm not sure.
- ^ Sometimes the person was still deemed alive in the bio until the correction
- ^ About half of the content consists of citations
- ^ Mill 1 - Pages Created - XTools
- ^ This breaks down to an average of slightly less than one added entry per edit but slightly more than one added reference per edit!
- ^ Wikidata; Preferences for me states 2,817 number of edits (per 15 Nov 2024)
- ^ The address of the client changes so only a limited set of edits are shown per session. 40 sessions * 5 edits per session = 200 automated edits