User talk:Mathglot/Regex
All regular expressions are standard PCRE unless otherwise stated. A few might be Cirrus regexes used by Wikipedia's regex editor; see also Elastic search Cirrus regex syntax.
Regex match
[edit]Doc links
[edit]Regex flavors:
- Lua: mw:Lua manual#Patterns (also § Ustring patterns), Help:Lua for beginners#Understanding patterns, Help:Lua for beginners#Note on Lua patterns versus regular expressions,
- No alternation, or *very* limited: see mw:Help:CirrusSearch/Logical operators)
- No repetition of multi-char patterns; i.e., no way to match choo, choo-choo or choo-choo-choo
- No min-max repetitions, like {3,5}; e.g., do
%d%d%d%d?%d?
instead of PCRE:[0-9]{3,5}
- AWB uses A .NET flavor of regex; see Wikipedia:AutoWikiBrowser/Regular expression
- Advanced search uses: mw:Help:CirrusSearch: including Cirrus regex
- Help:Searching/Regex#Metacharacters
- Help:Searching/Features – stemming, fuzzy, linksto, hastemplate, incategory|a|b, morelike, sorting, boosting,
Piped links
[edit]- Match piped link in any namespace (e.g, could be 'File:' in first part)
\[\[([^]]+)\|([^]]+)\]\]
- Match piped link in current namespace only (not containing colon in first part):
\[\[([^]:]+)\|([^]]+)\]\]
\[\[([^|\]]*?)\|([^]]+)\]\]
- Piped or unpiped link in current namespace:
\[\[([^|\]]+)\|?([^|\]]+)?\]\]
Piped links with possible WP:NOPIPE or WP:NOTBROKEN issues
[edit]\[\[([^\|]{6,})([^\|]+)\|\1[^\]]?\]\]
Reference
[edit]<ref(name\s*=|s*"?([-\s\w\d]+)?"?)\s*>([^<]+)</ref>
Reference with lastN but no matching firstN
[edit]<ref[^>]*?>[^>]*?\|last(\d)?(?!.*?\|first\1?).*?<\/ref>
Citation template creation
[edit]cite book, from Google bibliographic info
[edit]- Search:
Title\t(.*)
Author\t(.*)
Publisher\t(.*), (\d\d\d\d)
ISBN\t([^,]+),? ?(\d+)?
Length\t(\d+) pages$
- Replace:
{{cite book |author=\2 |last= |first= |date=\4 |title=\1 |publisher=\3 |isbn=\5 <!--isbn2=\6--> |page= <!--total-pages=\7-->}}
- Search:
Title\t(.*)
Author\t(.*)
Edition\t(.*)
Publisher\t(.*), (\d\d\d\d)
Original from\t.*
Digitized\t.*
ISBN\t([^,]+),? ?(\d+)?
Length\t(\d+) pages$
- Replace:
{{cite book |author=\2 |last= |first= |date=\5 |title=\1 |edition=\3 |publisher=\4 |isbn=\6 <!--isbn2=\7--> |page= <!--total-pages=\8-->}}
- Search:
Title\t(.*)
(.*)
Editors?(.*)
Edition\t(.*)
Publisher\t(.*), (\d\d\d\d)
ISBN\t([^,]+),? ?(\d+)?
Length\t(\d+) pages$
- Replace:
{{cite book |author= |last= |first= |editors=\3 |editor1-last= |editor1-first= |editor2-last= |editor2-first= |editor3-last= |editor3-first= |date=\6 |title=\1 |edition=\4 |publisher=\4 |series=\2 |isbn=\7 <!--isbn2=\8--> |page= <!--total-pages=\9-->}}
- Search:
Title\t(.*)
(.*)
Editors\t(.*)
Edition\t(.*)
Publisher\t(.*), (\d\d\d\d)
ISBN\t([^,]+),? ?(\d+)?
Length\t(\d+) pages$
- Replace:
{{cite book |last= |first= |editors=\3 |editor1-last= |editor1-first= |editor2-last= |editor2-first= |editor3-last= |editor3-first= |date=\6 |title=\1 |edition=\4 |series=\2 |publisher=\5 |isbn=\7 <!--isbn2=\8--> |page= <!--total-pages=\9-->}}
Citation template tweaking and reordering
[edit]Fix MOS:REFPUNCT problems
[edit]- Search:
<ref([^<]+)</ref>([.,;?!])
- Replace:
\2<ref\1</ref>
Convert <ref>{{Harvtxt...}} (with page number; sans ref name) to {{sfn}}
[edit]- Search:
<ref>{{harvtxt(?:\s*(\|\s*[^|]+)\s*)(.*?)\|\s*(\d\d\d\d)\|(p+=\d+\s*)}}\s*</ref>
- Replace:
{{sfn\1\2|\3|\4}}
Citations with 'author=' to 'last=... first=...'
[edit]Assumes a regular CS1 or CS2 citation, with space before vertical bar, and '|author=' present:
- Search:
\|author=([ \w]+)\s+(\w[^\|]+)\s+\|
- Replace:
|last1=\2 |first1=\1 |
Alt (author or author1; name possibly wikilinked):
- Search:
\|author1?=\[?\[?([ -\w]+)\]?\]?\s+(\w[^\|]+)\s+\|
- Replace:
|last1=\2 |first1=\1 |
Move url to the back
[edit]- Search:
\s*\|url=([^|\s]+)([^}]+)}}
- Replace:
\2 |url=\1}}
Possible failure case:
*<!--Chenntouf-->{{cite web |last1=Chenntouf |first=Tayeb |date=1999 |title="La dynamique de la frontière au Maghreb", Des frontières en Afrique du xiie au xxe siècle |url=https://unesdoc.unesco.org/in/documentViewer.xhtml?v=2.1.196&id=p::usmarcdef_0000139816&file=/in/rest/annotationSVC/DownloadWatermarkedAttachment/attach_import_c35456f4-f4da-4b4a-b938-9d61f48fa689?_=139816fre.pdf&locale=fr&multi=true&ark=/ark:/48223/pf0000139816/PDF/139816fre.pdf#%5B%7B%22num%22:605,%22gen%22:0%7D,%7B%22name%22:%22XYZ%22%7D,-250,769,0%5D |access-date=2020-07-17 |website=unesdoc.unesco.org}}
Swap last with first
[edit]- Search:
\|first=([^|]+)\s\|last=([^|]+)\s
- Replace:
|last=\2 |first=\1
Swap editor-last with editor-first
[edit]- Search:
\s*\|editor-first=([^|]+)\s*\|editor-last=([^|]+)\s*\|
- Replace:
|editor-last=\2 |editor-first=\1|
Swap editorN-last with editorN-first
[edit]- Search:
\s*\|editor(\d)-first\s*=\s*([^|]+)\s*\|editor\1-last\s*=\s*([^|]+)\s*\|
- Replace:
|editor\1-last=\3|editor\1-first=\2|
Swap lastn with firstn
[edit]- Search:
\|title=([^|]+)\s?\|(last\d?)=([^|]+)\s?\|(first\d?)=([^|]+)\s?
- Replace:
|\2=\3 |\4=\5 |title=\1
Move last-first before title
[edit]- Search:
\|title=([^|]+)\s*\|last=([^|]+)\s\|first=([^|]+)\s
- Replace:
|last=\2 |first=\3 |title=\1
Move year after first
[edit]- Search:
^(.*?)\|first([^|]+)(.*?)\s*\|year=(\d+)(.*?)$
- Replace:
\1|first\2|year=\4 \3\5
Punctuation after citation, to before
[edit]Sfn:
- Search:
({{sfn[^}]+}})([-–—,;!\?\.])
- Replace:
<nowiki\2\1</nowiki>
Swap |first=X |last=y around so last is first in citation
[edit]- Search:
\|first(\d)=([^|]+)\s\|last\1=([^|]+)\s
- Replace: *
|last\1=\3 |first\1=\2
plain refs to cite web
[edit]Text sources which don't use {{cite web}} may be transformed by a series of regex replaces, if the format is reasonably standard. For example, this change by this series:
* => * {{cite web |last=
(\, ?)(.*)$ => |first=\2
\.\ +''(.*?)'' => |title=\1
first=([\s\w]+),\s+and\s+([\s\w]+),\s+([\s\w]+) => first=\1 |last2=\2 |first2=\3
\((\d{4})\) => |year=\1
\((\d{4})\)\s+([ \w]+)\. => |year=\1 |publisher=\2
\s+isbn\s+([-\d]{10,17}) => |isbn=\1
$ => |ref=harv }}
See also User:Mathglot/sandbox/Templates/Cite MLA (in progress...)
Updating named refs to template:R
[edit]Example: Holocaust denial, revision 843383121. Three steps:
1. change quoted named refs:
<ref name="([^"]+)"\s*\/> -> {{R|"\1"}}
2. change unquoted named refs (with or without trailing blanks before the slash)
<ref name=([^ #"'/=>?\\]+)\s*/> -> {{R|\1}}
3. combine consecutive R's
{{R\|([^}]+)}}\s*{{R\|([^}]+)}} -> {{R|\1|\2}}
g(repeat till done)
Edit summary:
Minimize visual impact on the wikicode of [[WP:NAMEDREFS|named refs]] using [[Template:R]]. No change to rendered footnote section. Using global regex replace: 1: (change quoted named refs): s!<ref name="([^"]+)"\s*\/>!{{R|\1}}!g 2: (change unquoted named refs): s!<ref name=([^ #"'/=>?\\]+)\s*/>!{{R|\1}}!g 3: (combine consecutive Rs into one): s!{{R\|([^}]+)}}\s*{{R\|([^}]+)}}!{{R|\1|\2}}!g
Other regex replace
[edit]Add leading hidden token to ref-named citations as prep for sorting the Bibliography
[edit]- Search:
<!--{{sfn\|LAST\|YYYY\|p=}}--> *<ref name="([\(\)\w]+)\s+(\d+)">
- Replace: *
<!--{{sfn|\1|\2|p=}}-->
Alphabetize citations in Bibliography
[edit]The technique is 1) add a leading token consisting of the (first) last name, 2) sort, 3) strip out the token. Only step 1 is shown:
- Search:
^\*\s*{{cite(.*?)\|\s*last1?\s*=\s*([^|]+)\s*(.*)$
- Replace: *
*<!--\2-->{{cite\1 |last1=\2\3
Article page history to parsed data
[edit]Turn article page history into a series of parsed lines:
- 1=ARTICLE_TITLE 2=REVISION 3=HH:MM 4=Month DD, YYYY 5=TOTAL_BYTES 6=BYTE_CHANGE
- Go to article page history page
- Rt-click, Page source
- Select-all, copy, paste
- Apply Search/Replace Regex below, with "dot matches newline"
- Optional step to convert underscore to blank in article titles
SEARCH:
<li.*?index.php\?title=([^&]+)&oldid=(\d+)[^>]+>(\d\d:\d\d),\s(.*?)</a>.*?title="([,\d]+)\sbytes after change of this size">(.?\d+)</span>.*?</li>
To generate the following output, use this replacement:
1=ARTICLE_TITLE 2=REVISION 3=HH:MM 4=Month DD, YYYY 5=TOTAL_BYTES 6=BYTE_CHANGE
REPLACE:
1=\1 2=\2 3=\3 4=\4 5=\5 6=\6
To generate the following sample output, use this replace instead:
- 916661155 Risk_aversion diff 00:29 September 20, 2019; (change:-1b to 31,671 bytes)
REPLACE:
* [[Special:Permalink/\2|\2]] [[\1]] [[Special:Diff/\2|diff]] \3 \4; (change:\6b to \5 bytes)
To generate a six-column table row with this data, including one extra column for remarks, use this:
REPLACE:
|-
| [[\1]] || [[Special:Permalink/\2|\2]] || [[Special:Diff/\2|\6]] || \5 || \3 \4 || any remark here
Followed by optional underscore replacement. (s/_/ /gi).
To generate the following table row examples (table header/footer code added for context):
Article history for Example user | |||||
---|---|---|---|---|---|
Article | Perm | Diff | Len | Timestamp | Remark |
Risk aversion | 916661155 | -1 | 31,671 | 00:29 September 20, 2019 | any remark |
History of the provincial electoral map of Quebec | 916660706 | -1 | 26,988 | 00:26 September 20, 2019 | other remark |
User contribution history to parsed data
[edit]Turn article page history into a series of parsed lines:
- 1=REVISION 2=TITLE 3=TIMESTAMPE 4=BYTE_CHANGE 5=EDIT_SUMMARY
- Go to user contrib history page
- Rt-click, Page source
- Select-all, copy, paste
- Find '<h4 class="mw-index-pager-list-header-first' and cut everything above it.
- Find '
- Apply Search/Replace Regex below, with "dot matches newline"
- Optional step to convert underscore to blank in article titles
SEARCH: (options: dot matches newline)
^<li data-mw-revid="(\d+)".*?class="mw-changeslist-date" title="(.*?)">(.*?)</a>.*?size">(.*?)</strong>.*?parentheses">(.*?)</span>.*?</li>$
To generate the following output
1=REVISION 2=TITLE 3=TIMESTAMPE 4=BYTE_CHANGE 5=EDIT_SUMMARY
use this replacement:
REPLACE:
1=\1 2=\2 3=\3 4=\4 5=\5
To generate: rev=REVISION title=TITLE timestamp=TIMESTAMPE bytes=BYTE_CHANGE summary=EDIT_SUMMARY
REPLACE:
rev=\1 title=\2 time=\3 bytes=\4 summary=\5
To generate: rev=REVISION title=TITLE
SEARCH: (options: dot matches newline)
^<li data-mw-revid="(\d+)".*?title="([^"]+).*?</li>$
REPLACE:
rev=\1 title=\2
Convert glossary anchor to vanchor
[edit]SEARCH:
^;\s*{{Anchor\|([^\}]+)}}(?:[-<>\s,:\w\d]+)$
REPLACE:
;{{Vanchor|\1}}
Convert glossary &tl;term> to be in-linkable
[edit]SEARCH:
^{{term\s*\|(term\s*=\s*)?([^|{}]+)
REPLACE:
{{term|\1|2={{Vanchor|\2}}
ES: Convert glossary <term>s to be in-linkable via global regex replace s!^{{term\s*\|(term\s*=\s*)?([^|{}]+)!{{term|\1|2={{Vanchor|\2}}!g
Parse wikilinks
[edit]Parse wikilinks, exclude colons to exclude namespaces (this will exclude wikilinks that have colons in the anchor):
$1 = Target article $2 = Anchor (#-fragments untested):
\[\[([^:\|\]]+)\|?([^:\]]+)?\]\]
This saves the pipe (if there is one) in \2, so can use replace to generate lang-prefixed links, for example, if translating a nav template from en to fr, one could start like this:
- Search:
\[\[([^:\|\]]+)(\|?[^:\]]+)?\]\]
- Replace:
[[:en:\1\2]]
This adds superscript wikidata links to all wikilinks on a page so they can be easily translated:
- Search:
\[\[([^:\|\]]+)(\|?[^:\]]+)?\]\]('')?
- Replace:
[[\1\2]]\3<sup>[[[d:{{subst:wikidata|label|raw|page=\1}}#sitelinks-wikipedia|wd]]]</sup>
New contribs Translated pages to bullet list
[edit]From Special:contribs with 'new' pages box ticked; extracting pages with ContentTranslation tool summary:
- Search:
\) \. \. N\s([^(]*?) \(Created by translating the page "([^"]+)"
- Copy matches
- Replace:
* [[\1]] from [[es:\2]]
Interlanguage template transformation
[edit]- Turn {{ca:GEC}} into {{sfn}}:
- Search:
{{GEC\|id=([\d]+)\|nom=([ \w]+).*?}}
- Rplce:
{{sfn|GEC|loc=[http://www.enciclopedia.cat/EC-GEC-\1.xml \2]}}
- Search:
Convert italic markup to lang templates
[edit]First, fix the links (two types, depending where italic markup is):
- piped (type 1): e.g., ''[[École navale|FOO]]'' ⟶ {{lang|fr|[[École navale|FOO]]}}
- SRCH:
(?<!')''\[\[([^ |]+)\|([^]]+)\]\]''(?<!')
# handles the 2-pop case; excludes 3-pop, but also 5-pop; add (?:''')? for that - RPLC:
{{lang|fr|[[\1|\2]]}}
- SRCH:
- piped (type 2): e.g., [[École navale|''FOO'']] ⟶ {{lang|fr|[[École navale|FOO]]}}
- SRCH:
\[\[([^ |]+)\|((?<!')''([^']])+''(?<!')\]\]
- RPLC:
{{lang|fr|[[\1|\2]]}}
- SRCH:
- unpiped (order matters; must be done after piped links)
- SRCH:
''\[\[([^|]+)\]\]''
e.g., ''[[École navale]]'' ⟶ {{lang|fr|[[École navale]]}} - RPLC:
{{lang|fr|[[\1]]}}
- SRCH:
- What's left, is unlinked:
- SRCH:
(?<!')''([^']+)''(?<!')
e.g., ''École navale'' ⟶ {{lang|fr|École navale}} - RPLC:
{{lang|fr|\1}}
- SRCH:
FR - EN article translation preprocessing
[edit]1 <ref>{{(\w\w)}}\s*{{citation\|(.*?)</ref> -> {{efn|"{{lang|\1|\2}}"}} 2 ''{{lang\|de\|(.*?)}}'' -> {{lang|de|\1}} 3 {{citation\|(.*?)}} -> "\1" 4 <ref>\s*{{de}}\s*(.*?)\s*</ref> -> {{efn|{{lang|de|\1}}}} 5 <ref>{{harvsp\|(.*?)}}.</ref> -> {{sfn|\1}}
Substify and unsubstify
[edit]- Substify
- Search:
(?<!{){\{(?!\{)
- Replace:
{{ {{{|safesubst:}}}
- Unsubstify
- Search:
\s*{{\s*{{{\s*\|safesubst:\s*}}}
- Replace:
{{
Wikilink to {{ill}}
[edit]Aimed at Nav template translation, so handles bulleted links, optional pops or bolding, and specific lang prefix:
Unpiped links (e.g., * ''[[:fr:Documents maçonniques]]''
):
- Search:
^\*\s*('*)?\[\[:fr:([^]]+)\]\]('*)?
- Replace:
\1{{ill|ENGLISHNAME|fr|\2|v=sup}}\3
Piped links (.e.g., * ''[[:fr:Idées (revue, 1941-1944)|Idées]]''
):
- Search:
^\*\s*('*)?\[\[:fr:([^]|]+)\|?([^]]+)?\]\]('*)?$
- Replace:
\1{{ill|ENGLISHNAME|fr|\2|lt=\3|v=sup}}\4
For bios or proper names, duplicate the Foreign name in the English article field:
- Search:
^\*\s*('*)?\[\[:fr:([^]|]+)\|?([^]]+)?\]\]('*)?$
- Replace:
* \1{{ill|\2|fr|\2|lt=\3|v=sup}}\4
Examples:
* ''[[:fr:Le Juif et la France]]''
⟶* ''{{ill|Le Juif et la France|fr|Le Juif et la France|lt=|v=sup}}''
* ''[[:fr:Combats]]''
⟶* ''{{ill|Combats|fr|Combats|lt=|v=sup}}''
* ''[[:fr:Idées (revue, 1941-1944)|Idées]]''
⟶* ''{{ill|Idées (revue, 1941-1944)|fr|Idées (revue, 1941-1944)|lt=Idées|v=sup}}''
* [[:fr:Publications antisémites en France]]
⟶{{ill|Publications antisémites en France|fr|Publications antisémites en France|lt=|v=sup}}
Section demote
[edit]- Search:
^(={2,5})([^=].*?)\1
- Replace:
=\1\2\1=
Subsection promote
[edit]- Search:
^=(={2,5})([^=].*?)\1=
- Replace:
\1\2\1
Reflib section from last-first-year
[edit]- Search:
^\*\s*(.*?)\|last=([^|]+)\s\|first=([^|]+)\s*\|year=(\d+)(.*?)$
- Replace:
== \2-\4 == \1|last=\2 |first=\3|year=\4\5
Regionalize English: AE to BE
[edit]Zed to ess (recognize ⟶ recognise)
[edit]- Search: /((?:[a-z-[aeiuo]]{0,3}[aeiouy]{1,2}){1,}[a-z-[aeiuo]]{0,3}[iy])z((?:e|ed|es|er|ers|ing)\b)/g
- Replace: $1s$2