Jump to content

User:SJK/Year in Review database notes: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Simon_J_Kissane (talk)
okay, now i've downloaded all the neccessary data for the years
Simon_J_Kissane (talk)
preliminary statistics on Year in Review entries
Line 3: Line 3:




It stores the downloaded files in the data/ directory, under the page's title (with spaces replaced with underscores, etc.). Each file contains the page's Wiki source (what you see when you edit). It inserts as the first line of each page the following command "#YEAR [[''name of page'']] REV=''latest revision of page''". Next I am going to analyse them, and try converting them to a database. I am not entirely sure what I will use, though it will probably be some combination of Perl, SICSTUS Prolog and Unix shell utilities. -- [[SJK]]
It stores the downloaded files in the data/ directory, under the page's title (with spaces replaced with underscores, etc.). Each file contains the page's Wiki source (what you see when you edit). It inserts as the first line of each page the following command "#YEAR [[''name of page'']] REV=''latest revision of page''". Next I am going to analyse them, and try converting them to a database. I am not entirely sure what I will use, though it will probably be some combination of Perl, SICSTUS or GNU Prolog and Unix shell utilities.



These are some preliminary statistics on the structure of the entries:

* years present: 1032

* contain "Events" section: 988

* contain "Births" section: 967

* contain "Deaths" section: 967

* none of the abovementioned sections: 65



The above statistics are probably not 100% accurate, but they would be close...



-- [[SJK]]


-----------
-----------

Revision as of 10:53, 9 November 2001

2001-11-09 09:55 UTC: I have just completed downloading all the year entries up to 2001 for "Year in Review". (I will do the date entries later). I did it using a perl script; if you want to see it, it is /yrget perl script. It requires an "ENTRIES" file, which contains a list of all the year entries (I got one from downloading the index, and then using sed and grep with appropriate regexps).


It stores the downloaded files in the data/ directory, under the page's title (with spaces replaced with underscores, etc.). Each file contains the page's Wiki source (what you see when you edit). It inserts as the first line of each page the following command "#YEAR ''name of page'' REV=latest revision of page". Next I am going to analyse them, and try converting them to a database. I am not entirely sure what I will use, though it will probably be some combination of Perl, SICSTUS or GNU Prolog and Unix shell utilities.


These are some preliminary statistics on the structure of the entries:

  • years present: 1032
  • contain "Events" section: 988
  • contain "Births" section: 967
  • contain "Deaths" section: 967
  • none of the abovementioned sections: 65


The above statistics are probably not 100% accurate, but they would be close...


-- SJK


My goal here: to replace all the hard to maintain lists and other organizational features of Wikipedia with databases. I am planning to start with Year in Review and work from there. Comments on how to best do this, and how I propose to do it below, are more than welcome. (But if you object to the proposal in principle, forget about that until later -- once I have written the code then we will of course discuss whether to install it...) I plan to begin writing code after the end of my exams (Nov. 27).


We will use PHP to write this script, so it can be integrated into Magnus' PHP wiki.


The main YIR table in the database will have the following format:

 Year|Month|Day|EventType|Text

Where EventType is (Birth,Death,Event or a Noble Prize) and Text is standard Wiki text

Note possible for Month or Day to have null values


NoblePrize is of course Noble Prize Physics, Noble Prize Chemistry, etc... Maybe this belongs in separate NOBLEPRIZE table:

 Year|Month|Day|Field|Awardee|Comment

Then we can use the NOBLEPRIZE table to generate subpages of Noble Prize


Every year and month/day page will have subpages "/Intro" and "/Extra". These subpages will be automatically incorporated into the article at the appropriate points.


Eventually, the "Birth", "Death" eventtypes will be automatically generated from the Biographical Database.


I will write routines to use to:

1. download all Year-In-Review entries
2. extract data into database


We will generate (at this stage) two different kinds of reports: a "what happened in that year" report, and a "what happened on that day report"


We will produce the following output for the "what happened in that year report":


Centuries: Year in Review ''current-century''


''prev-century'' - ''current-century'' - ''next-century''


Decades:

for every decade D from last decade of previous century to first decade of next century

convert D to text form T
if D is current century, output T
else output T
if not last iteration print ' - '

endfor


for every year Y from current year-5 to current year+5

if Y not current year output "Y "
else output bold "Y "

endfor





(script will insert /Intro data here...)


Births


select year, month, day, eventtype, text from YEARINREVIEW

where eventtype = Birth and year = current year

for each row in resultset

if month, day not null then
convert month, day to string DateString
write "*DateString - text"
else
write "* text"

endfor


Deaths


same code as for births, mutatis mutandis


Events


same code as for births, mutatis mutandis


Nobel Prizes


SQL query to generate output below (too tired to explain in detail, should be obvious to someone less tired):


(I will add support for Noble Prizes as a special event type...)


(script will insert /Extra info here...)


e.g. Technology or Films sections found in some YIR entries



I will also have a report for each date:

....



And an edit dialogue for year/date report, replacing the standard edit dialogue:


List of Events (list generated by SQL query based on criteria above)

Year|Month|Day|EventType|edit entry
Text


Introductory information:

blah blah blah

edit introductory information -- link to action=edit&id=YEAR/Intro


Additional information:

edit additional information -- link to action=edit&id=YEAR/Extra


Issues:

  • edit conflicts -- edits to intro & extra easy -- they are pages, and current code can mostly be resued
  • edit conflicts -- individual events -- in principle no different from above, just shorter... code would have to be different, since we'd be using different SQL tables; but can probably use principles of pre-existing code...



Let me point out that the year in review entries are FAR from standard. Several changes have occurred to the format and the multiple working points means that there are many, many formats out there! --MichaelTinkler

Well, based on the ones I've looked at, they seem pretty standard. There are a few minor variations, and some sections present in some but not others (e.g. Film, Technology) -- but my planned database will support any nonstandard sections through in the "additional information" part, which can contain anything. Can you point to any particular ones which are very nonstandard? -- SJK

there are whole centuries where there are lots of years in which none of the information ABOVE the births/deaths/events info is present yet. The formatting is shifty - some people have been putting 3 centuries in review on the top line (preceding, current, following) and some have been putting only current. Some have been doing

century in review
centuries: preceding current following
decades:
years:

The whole transition of years from listing the currrent year +/- five instead of beginning and ending with current decade only is far from complete.

There are lots of minor variations like that. I'd say go ahead and do it, because then we'll find out eventually what's not standard and make it so, I suppose. --MichaelTinkler


Those sort of things were what I was referring to as the "minor variations"... me the optimist :) -- SJK