Wikipedia:U.S. Census Migration
This is an essay. It contains the advice or opinions of one or more Wikipedia contributors. This page is not an encyclopedia article, nor is it one of Wikipedia's policies or guidelines, as it has not been thoroughly vetted by the community. Some essays represent widespread norms; others only represent minority viewpoints. |
US Census Migration
This page is for tracking the migration of external links to the US Census database website. The original content of this page was copied from WP:Village pump (technical)#Programming help - US Census links going dark, specifically, this revision.
Programming help - US Census links going dark
[edit]The main site for the US Census will be taken offline March 30 (https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml). We have Census links in about 40,000 articles on Enwiki, possibly more. They will be dead soon.. There are a number of technical complications that make it challenging to archive and/or move links to the new site. Most of the links are not and can not be archived at Wayback, but they can be archived at archive.today. Seeking a collaborator(s) to help with this project. My bot WaybackMedic can add archive.today links into Enwiki if they already exist at archive.today, and/or it can migrate links to the new form, if there is a translation program. Need one or both of these:
- Find all factfinder.census.gov links on Enwiki and save them at archive.today - most of the links are in templates so would need to be discovered by scraping HTML or something.
- A program (command-line driven) that can take as input an old link and output the new link. Documentation on old link format and new link format. This program can then be called from WaybackMedic which would do the updates on wiki - this is needed because it would handle archive URLs (adding and removing as needed),
{{dead link}}
tags etc..
If you would like to help let's try to save the Census data from disappearing. Note that #2 could be done any time in the future is not limited by the March 30 cut off. #1 will not work after that date since the site will be dead. -- GreenC 17:16, 9 February 2020 (UTC)
- No need to HTML scrape. This looks like it should be relatively straight-forward via the wiki API and parsing the wikitext. I'm taking a look at that right now. I have also emailed cedsci.feedback@census.gov to make sure they are aware of this issue, and invited their participation in this discussion. -- RoySmith (talk) 18:24, 9 February 2020 (UTC)
- Surely Special:Linksearch (or its equivalent API:Exturlusage) can be used to conveniently get all the links? No need to scrape anything or parse wikitext. SD0001 (talk) 18:53, 9 February 2020 (UTC)
- Does that expand templates? Also how to find which template is associated with a URL. -- GreenC 18:54, 9 February 2020 (UTC)
- Yup, API:Exturlusage gets us most of the way there. I'm seeing 18203 external links directly in mainspace, plus a few in Templates:
- Does that expand templates? Also how to find which template is associated with a URL. -- GreenC 18:54, 9 February 2020 (UTC)
- Surely Special:Linksearch (or its equivalent API:Exturlusage) can be used to conveniently get all the links? No need to scrape anything or parse wikitext. SD0001 (talk) 18:53, 9 February 2020 (UTC)
Template:American Factfinder Template:American Factfinder2 Template:American Factfinder2/doc Template:American Factfinder2/sandbox Template:American Factfinder3 Template:American Factfinder3/doc Template:American Factfinder/doc Template:Cite American Factfinder Template:Cite American Factfinder2 Template:Cite American Factfinder3 Template:Data United States Template:Editnotices/Page/Spring, Texas Template:Historical populations Template:Historical populations/doc Template:Historical populations/sandbox Template:Historical populations/testcases Template:Historical populations/USCensusRef Template:Infobox ethnic group/testcases Template:Largest urban areas of Oceania Template:Middle Eastern American Template:NYC Chinatowns
-- RoySmith (talk) 19:05, 9 February 2020 (UTC)
- @GreenC and RoySmith: Those tools both query the
externallinks
table, which doesn't know or care how a link got into the page, just that it exists. Therefore, we can use Quarry to (fairly) easily generate a list of all ~20000 links from the English Wikipedia to http{s}://factfinder.census.gov: https://quarry.wmflabs.org/query/42039. That makes part 1 a bit easier, at least on the enwiki side. As far as figuring out where the links are coming from, https://quarry.wmflabs.org/query/42040 is the list of templatespace pages that link to FactFinder. --AntiCompositeNumber (talk) 19:29, 9 February 2020 (UTC) (edit conflict)
- @AntiCompositeNumber: Wonder what the {{Infobox Settlement}} link is? It is used on half a million pages and could be a lot, I'm unable to find anything. -- GreenC 19:56, 9 February 2020 (UTC)
- GreenC, It's the {{FIPS}} links in the doc. --AntiCompositeNumber (talk) 20:00, 9 February 2020 (UTC)
- @GreenC and RoySmith: Those tools both query the
- Somewhat surprisingly, this is not just an enwiki problem. I'm seeing 7490 in dewiki and 1274 in eswiki. I'm sure that's just the tip of the iceberg. -- RoySmith (talk) 19:09, 9 February 2020 (UTC)
- Just to keep stuff in one place, https://github.com/roysmith/factfinder-migration. Remember to like, share, and subscribe. -- RoySmith (talk) 19:18, 9 February 2020 (UTC)
- @RoySmith: thanks for the help. Ideally could it be three columns: article name, template in the article and URL it equates to; or 2 columns if no template involved. I was informed by User:Fabrikator "there are over 40,000 Wikipedia articles that use these templates" hopefully we are not missing anything. -- GreenC 19:33, 9 February 2020 (UTC)
There are about 30,000 in Template:Historical populations-- GreenC 19:36, 9 February 2020 (UTC) There is a{{cite web}}
in the template /doc page that contains a factfinder URL, but I don't see anything generated by the template code that is probably what Fabrikator was seeing. -- GreenC 19:42, 9 February 2020 (UTC)
- @RoySmith: thanks for the help. Ideally could it be three columns: article name, template in the article and URL it equates to; or 2 columns if no template involved. I was informed by User:Fabrikator "there are over 40,000 Wikipedia articles that use these templates" hopefully we are not missing anything. -- GreenC 19:33, 9 February 2020 (UTC)
- @RoySmith and GreenC: I did some work on trying to rewrite URLs from AFF to CEDSCI (data.census.gov), and one thing became very apparent: the US Census Bureau put absolutely no thought into the transition between these two systems. We also seem to disagree on what "There are no changes in table IDs associated with the change to a new platform on data.census.gov" [1] is supposed to mean. --AntiCompositeNumber (talk) 05:58, 10 February 2020 (UTC)
- AntiCompositeNumber, nice work. Looking at the FAQ like page 21 it seems like they are not doing a 1:1 transition, there is data being left behind or only available via API. -- GreenC 15:21, 10 February 2020 (UTC)
- My understanding is that nowadays, when you add a URL-based citation, that we have User:InternetArchiveBot that automatically attempts archiving at Wayback Machine or another source. These links should be in place, but you may want to check that. --Masem (t) 19:22, 9 February 2020 (UTC)
- Gah, nevermind I see the issue of auto archiving was addressed :P --Masem (t) 19:24, 9 February 2020 (UTC)
- Yeah WaybackMachine has trouble archiving many of these pages due to JS or whatever they are using. Archive.today seems to do OK on the few I tried manually. -- GreenC 19:25, 9 February 2020 (UTC)
- Gah, nevermind I see the issue of auto archiving was addressed :P --Masem (t) 19:24, 9 February 2020 (UTC)
Queries across wikis
[edit]@AntiCompositeNumber: is it possible to adjust the query ( https://quarry.wmflabs.org/query/42039 ) to work across all wiki languages and projects (including Wikidata and Commons) or does it require specifying the database? It would be good to try and save everything if possible even if we can't right away edit those other projects, in particular since the #2 option of translation might not be feasible. -- GreenC 16:24, 10 February 2020 (UTC)
- GreenC, Somebody with better db-fu than myself may have a better answer, but as far as I know, there's no way to do queries across multiple wikis. That's what the "USE enwiki_p" line does, selects the enwiki database. -- RoySmith (talk) 16:36, 10 February 2020 (UTC)
- On the other hand, you can discover all the wiki databases with
show databases like '%wiki_p';
. From there, I could see writing a script which iterates over each one, executes the query there, and combines the result. Beats me if there's some way to express all that in a single SQL query, however. -- RoySmith (talk) 16:45, 10 February 2020 (UTC)- OK, I've got that going. A total of 64791 rows from 159 wiki databases (including commons and wikidata). Trivia question of the day: what language wiki links to the US Census more often than enwiki? -- RoySmith (talk) 17:24, 10 February 2020 (UTC)
- Ah you beat me to it (and more elegant). I am using Toolforge with
/usr/bin/mysql --defaults-file=$HOME/replica.my.cnf -h enwiki.analytics.db.svc.eqiad.wmflabs enwiki_p < $HOME/census/test2.sql > $HOME/census/allwikis.txt
which works but still needed a wrapper program to modify the .sql file for each site. I tried running your .py but unknown import of "toolforge" is that something I can find on Toolforge? -- GreenC 17:43, 10 February 2020 (UTC) - Or if you are on Toolforge and can copy the results to
/data/project/botwikiawk/census
that would be great. -- GreenC 17:46, 10 February 2020 (UTC)- It's just a thin wrapper around pymysql. You should be able to do 'pip install toolforge'. Or 'pip install -r requirements.txt' -- RoySmith (talk) 17:49, 10 February 2020 (UTC)
- No write permission on /data/project/botwikiawk/census/, but I put it in /tmp/cross-wiki-links on dev.tools.wmflabs.org. -- RoySmith (talk) 17:59, 10 February 2020 (UTC)
- Ok file copied, thank you! For some weird reason Toolforge doesn't have 'pip'. -- GreenC 18:30, 10 February 2020 (UTC)
- It should be in your venv.
/mnt/nfs/labstore-secondary-tools-home/roysmith/factfinder/venv/bin/pip
. -- RoySmith (talk) 19:05, 10 February 2020 (UTC)- Ah virtual and python.. never got that far (not a Python programmer). Permission denied on pip. Roy, that program is generally useful. A similar request recently came up at WP:URLREQ a user wanted to track down instances of a domain cross-wiki since it was serving malware. AFAIK there is no easy way to search for URLs cross wiki. -- GreenC 20:04, 10 February 2020 (UTC)
- No, there really isn't. I've been working on a toolforge tool to make it easier off and on for a bit though --AntiCompositeNumber (talk) 23:26, 10 February 2020 (UTC)
- Ah virtual and python.. never got that far (not a Python programmer). Permission denied on pip. Roy, that program is generally useful. A similar request recently came up at WP:URLREQ a user wanted to track down instances of a domain cross-wiki since it was serving malware. AFAIK there is no easy way to search for URLs cross wiki. -- GreenC 20:04, 10 February 2020 (UTC)
- It should be in your venv.
- Ok file copied, thank you! For some weird reason Toolforge doesn't have 'pip'. -- GreenC 18:30, 10 February 2020 (UTC)
- No write permission on /data/project/botwikiawk/census/, but I put it in /tmp/cross-wiki-links on dev.tools.wmflabs.org. -- RoySmith (talk) 17:59, 10 February 2020 (UTC)
- It's just a thin wrapper around pymysql. You should be able to do 'pip install toolforge'. Or 'pip install -r requirements.txt' -- RoySmith (talk) 17:49, 10 February 2020 (UTC)
- Ah you beat me to it (and more elegant). I am using Toolforge with
- OK, I've got that going. A total of 64791 rows from 159 wiki databases (including commons and wikidata). Trivia question of the day: what language wiki links to the US Census more often than enwiki? -- RoySmith (talk) 17:24, 10 February 2020 (UTC)
- On the other hand, you can discover all the wiki databases with
- @RoySmith: would it be possible to list the sites with the most factfinder links? Some sites might be fixed with IABot and the rest we can notify them so local bot or manual work can be done. And if AntiCompositeNumber is successful with a transform program. To answer your trivia question I guess Wikidata is the second highest overall, but for Wikipedia sites I guess Sweden (there was a user who made a bot that created millions of articles ). -- GreenC 15:09, 11 February 2020 (UTC)
- @GreenC: That sounds straightforward. I'll take a look sometime in the next day or two. Also, @AntiCompositeNumber:, I could probably get some sort of on-line tool working pretty quickly, but don't want to step on your efforts if you're already into that. Let me know. Oh, and PS, it's Catalan. -- RoySmith (talk) 20:25, 11 February 2020 (UTC)
Creating archives
[edit]Now there is a list of target URLs (thank you RoySmith and AntiCompositeNumber) the next step is to create the archives. I've contacted Wayback to see if they can determine why the pages won't save correctly. I've also contacted archive.today - pages save correctly there but given the scale want to get their permission and how they want to manage archivals. -- GreenC 18:51, 10 February 2020 (UTC)
- Archive.today agreed to do archives and they have the list (54k after sort/uniq). -- GreenC 19:33, 10 February 2020 (UTC)
- Somebody should write an edit filter that catches any new links. Or maybe we just need to do a mop-up pass near the end of March to pick up any late additions. RoySmith-Mobile (talk) 22:48, 10 February 2020 (UTC)
- yes.. easiest to run your python program again in late march and compare the old and new lists for additions. I'll send to archive.today or manual if not too many. -- GreenC 01:17, 11 February 2020 (UTC)
- Informed by Archive.today that many links don't work. List here. Click any image for the original URL. Maybe the redirects created by the Census will resolve them. -- GreenC 14:47, 13 February 2020 (UTC)
- I took a look at the first one. It's in Columbia,_Maryland#2000_census, specifically:
{{cite web |url=http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk |accessdate=24 Dec 2014 |title=Selected Economic Characteristics |url-status=dead |archiveurl=https://web.archive.org/web/20160417040001/http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk |archivedate=2016-04-17 }}
- So, this was known to be dead a while ago. In theory it was archived at archive.org, but when I try to access the archive URL, it hangs, with a bunch of errors in the javascript console.
- Digging a bit deeper, if you search for "Selected Economic Characteristics site:factfinder.census.gov", you can get to https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml, and if you type "Columbia CDP, Maryland" into the search box, you get to a page which I suspect is what was originally being cited. Pinging User:FlugKerl who I think originally added that citation. The problem is, this is one of those old-style javascript apps which doesn't store all of its state in the URL. There's a bunch of cookies dropped under the census.gov domain. If you delete, for example, the JSESSIONID cookie, the app loses track of the fact that you're looking at Columbia CDP, Maryland. So, the bottom line is that I suspect many (if not most) of these external links were broken from the get-go, and any attempt at archiving is hopeless without a significant amount of work to recreate the original searches. Yuck. -- RoySmith (talk) 16:05, 13 February 2020 (UTC)
- Exactly. Only the factfinder.census.gov/bkmk/ path is known to be stable. The factfinder.census.gov/servlet/ path sometimes contains usable data, sometimes falls back to search results when there's insufficient or outdated information in the URL, and sometimes doesn't work entirely. In the "By tool" table, any path marked in black is inaccessable and unrecoverable without additional citation information. --AntiCompositeNumber (talk) 01:45, 15 February 2020 (UTC)
- @GreenC: At this point, the transformation script is about as functional as I can get it. The biggest problem is the lack of data availability: we'll have to rely on archived copies until (unless) the Census Bureau gets around to loading 2000-2010 data. There are instructions in the readme in https://github.com/roysmith/factfinder-migration --AntiComposite (talk) 03:51, 9 March 2020 (UTC)
- Update The links have been saved at archive.today (54k) .. the next step I will be checking IABot database has an entry for each link and if not create it - this is via API and slow. Then upload the archive.today URLs to IABot. Slow. Then fix the links on Enwiki trying AntiComposite's script. -- GreenC 03:22, 25 March 2020 (UTC)
Transformation
[edit]Dataset | CEDSCI Availability | Tranformation |
---|---|---|
ACS | 2010 and later summary data tables only | Yes |
AHS | No | — |
ASM | Not yet | — |
BES | No | — |
BP | Pre-2012 County Business Patterns not available | No |
CFS | Not yet | — |
COG | Not yet | — |
DEC | 2010 Congressional District 113, CD 115, and Summary File 1 only | Yes |
ECN | Yes | No |
EEO | No | — |
GEP | No | — |
NES | Pre-2012 Nonemployer data unavailabe | Yes |
PEP | Not yet | — |
PP | No | — |
SBO | Only SBO Company Summary tables available | Yes |
SGF | No | — |
SLF | No | — |
SSF | No | — |
STC | No | — |
Endpoint | CEDSCI availability | Transformation |
---|---|---|
factfinder.census.gov/bkmk/table/* | Yes | Partial |
factfinder.census.gov/bkmk/cf/* | Zip codes not supported | Yes |
factfinder.census.gov/bf/* | All links from enwiki dead | N/A |
factfinder.census.gov/faces/nav/jsf/pages/* | URL contains no data | N/A |
factfinder.census.gov/faces/affhelp/jsf/pages/geography.xhtml?* | No single page can replace the "About this geography" view | No |
factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=* | Default geographies (US, all avaliable states, first available) assumed | Yes |
factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=* | URL contains no data | N/A |
factfinder.census.gov/help/* | Unknown | ? |
factfinder.census.gov/rest/* | No | No |
factfinder.census.gov/servlet/QTTable?* | Quick Tables no longer available | No |
factfinder.census.gov/servlet/GCTTable?* | Geographic Comparision Tables no longer available | No |
factfinder.census.gov/servlet/DTTable?* | ? Search results |
Not yet |
factfinder.census.gov/servlet/MapItDrawServlet?* | Significant sample of links dead | N/A |
factfinder.census.gov/servlet/IPTable?* | All appear to be pre-2010 ACS selected population profiles | No |
factfinder.census.gov/servlet/ADPTable?* | Pre-2010 ACS | No |
factfinder.census.gov/servlet/DTGeoSearchByListServlet?* | URL contains only table name and no default geography | N/A |
factfinder.census.gov/servlet/SAFFFacts?* | Zip codes not supported | Yes |
factfinder.census.gov/servlet/SAFFPopulation?* | Zip codes not supported | Yes |
factfinder.census.gov/servlet/ACCSAFFFacts?* | Zip codes not supported | Yes |
factfinder.census.gov/servlet/ReferenceMapFramesetServlet?* | ? | Not yet |
Marker | CEDSCI availability | Transformation |
---|---|---|
Yes | All corresponding data is available in CEDSCI | All AFF links can be automatically transformed to CEDSCI links |
Partial | Some AFF data is not available in CEDSCI or assumptions must be made | Some AFF links are not currently able to be transformed, but may be later. |
Not yet | The US Census Bureau plans to add this data to CEDSCI, but has not done so | Transformation has not been attempted |
? | This endpoint/feature has not been matched to a CEDSCI feature | — |
— | — | Transformation for this dataset is not possible |
No | Data is not available in CEDSCI at all | Automatic transformation is not possible due to a lack of information or data availability |
N/A | The target dataset can not be derived from the URL | The URL can not be transformed and needs to be replaced based on other citation information |
Discussions with US Census Bureau
[edit]I've been in contact with sombody from the Dissemination Outreach Branch, Census Enterprise Dissemination Services and Consumer Innovation (CEDSCI), U.S. Census Bureau. I'm told that starting April 1st, they are going to be working on "deep link redirects from the American Fact Finder to data.census.gov". It's unclear how this impacts our archiving work. Possibly not at all, but I'm trying to obtain more details. -- RoySmith (talk) 20:22, 11 February 2020 (UTC)
- So....we know it's definitely going to break, and then maybe magically start working again. --AntiCompositeNumber (talk) 21:51, 11 February 2020 (UTC)
- Something like this happened with the Dept of Justice website and they were able to redirect maybe half the links as the new site was not a 1:1 map to the old. And the old links were often better than the new, more complete info. Our work to archive the pages while possible is a good idea. -- GreenC 15:06, 12 February 2020 (UTC)
- RoySmith, In your discussions, have you found out when the rest of the planned data will hit CEDSCI? {{FIPS}} references the 2010 Census Demographic Profile DP-1 table, which isn't available yet. There are also a lot of ACS and PEP links that haven't been migrated either. --AntiCompositeNumber (talk) 16:58, 20 February 2020 (UTC)
- AntiCompositeNumber, Unfortunately, "discussions" (plural) is a bit of an overstatement. I had one exchange, and never got a response to my followup. -- RoySmith (talk) 17:17, 20 February 2020 (UTC)
Ready to add the archives
[edit]The archive.today links are loaded into IABot and it is ready to run across all wiki languages (that use IAbot about the top 20).
- AntiCompositeNumber's Python script is quite remarkable and I'd like to use it. However it will require a headless browser check to verify transformed URLs are working/live due to uncertainties at Census. Given the scale, this is daunting in time and resources. For now I think we should archive them all and re-visit after Census has had time to settle at the new site and work out bugs. Hopefully they start returning 404s for dead pages. WaybackMedic can unwind archives and replace with new primary URLs.
- About 3,000 URLs could not be loaded into IABot due to IABots normalization routines that rejected the URLs so I will process those via WaybackMedic.
- About 2,000 URLs already had archives in the IABot database. Spot checks showed they work, I left them in place and did not replace with archive.today
- About 57,000 URLs had archive.today links added into the IABot database. Not all of them are for Enwiki.
Tomorrow (Saturday) I will set the domain to Blacklisted and begin the IABot queue to process Enwiki, unless there are other thoughts. -- GreenC 12:34, 27 March 2020 (UTC)
As expected issue have arisen:
- In this diff [2] the source URL is https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk however this is not the original URL. There are thousands of URLs exactly like this one, all broken and likely unrecoverable.
- There is another domain factfinder2.census.gov with over 7,000 links. None of them are not working as far as I can tell, so they will all be dead unless an incidental archive exists.
- IABot database has a feature of treating http and https as separate records. Even if it had a record for http it won't for https. This is causing trouble where it can't find one or the other. It's not too frequent and WaybckMedic should be able to resolve most of those with time.
- Most of the existing archives in the IABot database were not working so overwrote those with the archive.today links where possible. Some could not due to IABot refusing the archive so will have to fix those with WaybackMedic.
-- GreenC 22:19, 28 March 2020 (UTC)
After 12 days of plugging away I am done with it. This was a huge job with more twist and turns than are worth documenting. Sample diff, repeat 60k+ times with endless variations. -- GreenC 00:01, 13 April 2020 (UTC)
- GreenC, Thank you for diving into this a running with it. You deserve some kind of "Hero of the bit-rot wars, with archivist clusters" barnstar. -- RoySmith (talk) 00:28, 13 April 2020 (UTC)
United States Census 2000 demographics
[edit]This is a bit off subject, but rather important. In the beginning, Wikipedia generated demographic sections for all U.S. census places based upon United States Census 2000 data. This was not repeated for the United States Census 2010. The 2000 demographic sections are now obsolete and need to be removed. Will (and can) new demographic sections be generated for United States Census 2020? Yours aye, Buaidh talk contribs 04:05, 19 December 2020 (UTC)