Jump to content

User talk:Mathglot/Regex

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

All regular expressions are standard PCRE unless otherwise stated. A few might be Cirrus regexes used by Wikipedia's regex editor; see also Cirrus regex syntax.

Regex match

[edit]
[edit]

Regex flavors:

[edit]
  1. Match piped link in any namespace (e.g, could be 'File:' in first part)
    • \[\[([^]]+)\|([^]]+)\]\]
  2. Match piped link in current namespace only (not containing colon in first part):
    • \[\[([^]:]+)\|([^]]+)\]\]
    • \[\[([^|\]]*?)\|([^]]+)\]\]
  3. Piped or unpiped link in current namespace:
    • \[\[([^|\]]+)\|?([^|\]]+)?\]\]
[edit]
  • \[\[([^\|]{6,})([^\|]+)\|\1[^\]]?\]\]

Reference

[edit]
  • <ref(name\s*=|s*"?([-\s\w\d]+)?"?)\s*>([^<]+)</ref>

Reference with lastN but no matching firstN

[edit]
  • <ref[^>]*?>[^>]*?\|last(\d)?(?!.*?\|first\1?).*?<\/ref>

Citation template creation

[edit]

cite book, from Google bibliographic info

[edit]
  • Search:

Title\t(.*)
Author\t(.*)
Edition\t(.*)
Publisher\t(.*), (\d\d\d\d)
Original from\t.*
Digitized\t.*
ISBN\t([^,]+),? ?(\d+)?
Length\t(\d+) pages$

  • Replace: {{cite book |author=\2 |last= |first= |date=\5 |title=\1 |edition=\3 |puublisher=\4 |isbn=\6 <!--isbn2=\7--> |page= <!--total-pages=\8}}

Citation template tweaking and reordering

[edit]

Fix MOS:REFPUNCT problems

[edit]
  • Search: <ref([^<]+)</ref>([.,;?!])
  • Replace: \2<ref\1</ref>

Citations with 'author=' to 'last=... first=...'

[edit]

Assumes a regular CS1 or CS2 citation, with space before vertical bar, and '|author=' present:

  • Search: \|author=([ \w]+)\s+(\w[^\|]+)\s+\|
  • Replace: |last1=\2 |first1=\1 |

Alt (author or author1; name possibly wikilinked):

  • Search: \|author1?=\[?\[?([ -\w]+)\]?\]?\s+(\w[^\|]+)\s+\|
  • Replace: |last1=\2 |first1=\1 |

Move url to the back

[edit]
  • Search: \s*\|url=([^|\s]+)([^}]+)}}
  • Replace: \2 |url=\1}}

Possible failure case: *<!--Chenntouf-->{{cite web |last1=Chenntouf |first=Tayeb |date=1999 |title="La dynamique de la frontière au Maghreb", Des frontières en Afrique du xiie au xxe siècle |url=https://unesdoc.unesco.org/in/documentViewer.xhtml?v=2.1.196&id=p::usmarcdef_0000139816&file=/in/rest/annotationSVC/DownloadWatermarkedAttachment/attach_import_c35456f4-f4da-4b4a-b938-9d61f48fa689?_=139816fre.pdf&locale=fr&multi=true&ark=/ark:/48223/pf0000139816/PDF/139816fre.pdf#%5B%7B%22num%22:605,%22gen%22:0%7D,%7B%22name%22:%22XYZ%22%7D,-250,769,0%5D |access-date=2020-07-17 |website=unesdoc.unesco.org}}

Swap last with first

[edit]
  • Search: \|first=([^|]+)\s\|last=([^|]+)\s
  • Replace: |last=\2 |first=\1

Swap editor-last with editor-first

[edit]
  • Search: \s*\|editor-first=([^|]+)\s*\|editor-last=([^|]+)\s*\|
  • Replace: |editor-last=\2 |editor-first=\1|

Swap editorN-last with editorN-first

[edit]
  • Search: \s*\|editor(\d)-first\s*=\s*([^|]+)\s*\|editor\1-last\s*=\s*([^|]+)\s*\|
  • Replace: |editor\1-last=\3|editor\1-first=\2|

Swap lastn with firstn

[edit]
  • Search: \|title=([^|]+)\s?\|(last\d?)=([^|]+)\s?\|(first\d?)=([^|]+)\s?
  • Replace: |\2=\3 |\4=\5 |title=\1

Move last-first before title

[edit]
  • Search: \|title=([^|]+)\s*\|last=([^|]+)\s\|first=([^|]+)\s
  • Replace: |last=\2 |first=\3 |title=\1

Move year after first

[edit]
  • Search: ^(.*?)\|first([^|]+)(.*?)\s*\|year=(\d+)(.*?)$
  • Replace: \1|first\2|year=\4 \3\5

Punctuation after citation, to before

[edit]

Sfn:

  • Search: ({{sfn[^}]+}})([-–—,;!\?\.])
  • Replace: <nowiki\2\1</nowiki>

Swap |first=X |last=y around so last is first in citation

[edit]
  • Search: \|first(\d)=([^|]+)\s\|last\1=([^|]+)\s
  • Replace: *|last\1=\3 |first\1=\2

plain refs to cite web

[edit]

Text sources which don't use {{cite web}} may be transformed by a series of regex replaces, if the format is reasonably standard. For example, this change by this series:

* => * {{cite web |last=
(\, ?)(.*)$ => |first=\2
\.\ +''(.*?)'' => |title=\1
first=([\s\w]+),\s+and\s+([\s\w]+),\s+([\s\w]+) => first=\1 |last2=\2 |first2=\3
\((\d{4})\) => |year=\1
\((\d{4})\)\s+([ \w]+)\. => |year=\1 |publisher=\2
\s+isbn\s+([-\d]{10,17}) => |isbn=\1
$ => |ref=harv }}

See also User:Mathglot/sandbox/Templates/Cite MLA (in progress...)

Updating named refs to template:R

[edit]

Example: Holocaust denial, revision 843383121. Three steps:

1. change quoted named refs:
<ref name="([^"]+)"\s*\/> -> {{R|"\1"}}

2. change unquoted named refs (with or without trailing blanks before the slash)
<ref name=([^ #"'/=>?\\]+)\s*/> -> {{R|\1}}

3. combine consecutive R's
{{R\|([^}]+)}}\s*{{R\|([^}]+)}} -> {{R|\1|\2}} g(repeat till done)

Edit summary:

Minimize visual impact on the wikicode of [[WP:NAMEDREFS|named refs]] using [[Template:R]]. No change to rendered footnote section. Using global regex replace: 1: (change quoted named refs): s!<ref name="([^"]+)"\s*\/>!{{R|\1}}!g 2: (change unquoted named refs): s!<ref name=([^ #"'/=>?\\]+)\s*/>!{{R|\1}}!g 3: (combine consecutive Rs into one): s!{{R\|([^}]+)}}\s*{{R\|([^}]+)}}!{{R|\1|\2}}!g

Other regex replace

[edit]

Add leading hidden token to ref-named citations as prep for sorting the Bibliography

[edit]
  • Search: <!--{{sfn\|LAST\|YYYY\|p=}}--> *<ref name="([\(\)\w]+)\s+(\d+)">
  • Replace: *<!--{{sfn|\1|\2|p=}}-->

Alphabetize citations in Bibliography

[edit]

The technique is 1) add a leading token consisting of the (first) last name, 2) sort, 3) strip out the token. Only step 1 is shown:

  • Search: ^\*\s*{{cite(.*?)\|\s*last1?\s*=\s*([^|]+)\s*(.*)$
  • Replace: **<!--\2-->{{cite\1 |last1=\2\3

Article page history to parsed data

[edit]

Turn article page history into a series of parsed lines:

  • 1=ARTICLE_TITLE 2=REVISION 3=HH:MM 4=Month DD, YYYY 5=TOTAL_BYTES 6=BYTE_CHANGE
  1. Go to article page history page
  2. Rt-click, Page source
  3. Select-all, copy, paste
  4. Apply Search/Replace Regex below, with "dot matches newline"
  5. Optional step to convert underscore to blank in article titles

SEARCH:
<li.*?index.php\?title=([^&]+)&oldid=(\d+)[^>]+>(\d\d:\d\d),\s(.*?)</a>.*?title="([,\d]+)\sbytes after change of this size">(.?\d+)</span>.*?</li>

To generate the following output, use this replacement:
1=ARTICLE_TITLE 2=REVISION 3=HH:MM 4=Month DD, YYYY 5=TOTAL_BYTES 6=BYTE_CHANGE

REPLACE:
1=\1 2=\2 3=\3 4=\4 5=\5 6=\6

To generate the following sample output, use this replace instead:

REPLACE:
* [[Special:Permalink/\2|\2]] [[\1]] [[Special:Diff/\2|diff]] \3 \4; (change:\6b to \5 bytes)

To generate a six-column table row with this data, including one extra column for remarks, use this:
REPLACE:
|-
| [[\1]] || [[Special:Permalink/\2|\2]] || [[Special:Diff/\2|\6]] || \5 || \3 \4 || any remark here

Followed by optional underscore replacement. (s/_/ /gi).

To generate the following table row examples (table header/footer code added for context):

Article history for Example user
Article Perm Diff Len Timestamp Remark
Risk aversion 916661155 -1 31,671 00:29 September 20, 2019 any remark
History of the provincial electoral map of Quebec 916660706 -1 26,988 00:26 September 20, 2019 other remark

User contribution history to parsed data

[edit]

Turn article page history into a series of parsed lines:

  • 1=REVISION 2=TITLE 3=TIMESTAMPE 4=BYTE_CHANGE 5=EDIT_SUMMARY
  1. Go to user contrib history page
  2. Rt-click, Page source
  3. Select-all, copy, paste
  4. Find '<h4 class="mw-index-pager-list-header-first' and cut everything above it.
  5. Find '
  6. Apply Search/Replace Regex below, with "dot matches newline"
  7. Optional step to convert underscore to blank in article titles

SEARCH: (options: dot matches newline)
^<li data-mw-revid="(\d+)".*?class="mw-changeslist-date" title="(.*?)">(.*?)</a>.*?size">(.*?)</strong>.*?parentheses">(.*?)</span>.*?</li>$

To generate the following output
1=REVISION 2=TITLE 3=TIMESTAMPE 4=BYTE_CHANGE 5=EDIT_SUMMARY
use this replacement:

REPLACE:
1=\1 2=\2 3=\3 4=\4 5=\5

To generate: rev=REVISION title=TITLE timestamp=TIMESTAMPE bytes=BYTE_CHANGE summary=EDIT_SUMMARY
REPLACE:
rev=\1 title=\2 time=\3 bytes=\4 summary=\5

To generate: rev=REVISION title=TITLE
SEARCH: (options: dot matches newline)
^<li data-mw-revid="(\d+)".*?title="([^"]+).*?</li>$
REPLACE:
rev=\1 title=\2

Convert glossary anchor to vanchor

[edit]

SEARCH:
^;\s*{{Anchor\|([^\}]+)}}(?:[-<>\s,:\w\d]+)$
REPLACE:
;{{Vanchor|\1}}

Convert glossary &tl;term> to be in-linkable

[edit]

SEARCH:
^{{term\s*\|(term\s*=\s*)?([^|{}]+)
REPLACE:
{{term|\1|2={{Vanchor|\2}}

ES: Convert glossary <term>s to be in-linkable via global regex replace s!^{{term\s*\|(term\s*=\s*)?([^|{}]+)!{{term|\1|2={{Vanchor|\2}}!g

[edit]

Parse wikilinks, exclude colons to exclude namespaces (this will exclude wikilinks that have colons in the anchor):

$1 = Target article $2 = Anchor (#-fragments untested):

  • \[\[([^:\|\]]+)\|?([^:\]]+)?\]\]

This saves the pipe (if there is one) in \2, so can use replace to generate lang-prefixed links, for example, if translating a nav template from en to fr, one could start like this:

  • Search: \[\[([^:\|\]]+)(\|?[^:\]]+)?\]\]
  • Replace: [[:en:\1\2]]

This adds superscript wikidata links to all wikilinks on a page so they can be easily translated:

  • Search: \[\[([^:\|\]]+)(\|?[^:\]]+)?\]\]('')?
  • Replace: [[\1\2]]\3<sup>[[[d:{{subst:wikidata|label|raw|page=\1}}#sitelinks-wikipedia|wd]]]</sup>

New contribs Translated pages to bullet list

[edit]

From Special:contribs with 'new' pages box ticked; extracting pages with ContentTranslation tool summary:

  1. Search: \)‎ \. \. N\s([^(]*?) ‎ \(Created by translating the page "([^"]+)"
  2. Copy matches
  3. Replace: * [[\1]] from [[es:\2]]

Interlanguage template transformation

[edit]
  1. Turn {{ca:GEC}} into {{sfn}}:
    • Search: {{GEC\|id=([\d]+)\|nom=([ \w]+).*?}}
    • Rplce: {{sfn|GEC|loc=[http://www.enciclopedia.cat/EC-GEC-\1.xml \2]}}

FR - EN article translation preprocessing

[edit]
1 <ref>{{(\w\w)}}\s*{{citation\|(.*?)</ref> -> {{efn|"{{lang|\1|\2}}"}}
2 ''{{lang\|de\|(.*?)}}'' -> {{lang|de|\1}}
3 {{citation\|(.*?)}} -> "\1"
4 <ref>\s*{{de}}\s*(.*?)\s*</ref> -> {{efn|{{lang|de|\1}}}}
5 <ref>{{harvsp\|(.*?)}}.</ref> -> {{sfn|\1}}

Substify and unsubstify

[edit]
Substify
  • Search: (?<!{){\{(?!\{)
  • Replace: {{ {{{|safesubst:}}}
Unsubstify
  • Search: \s*{{\s*{{{\s*\|safesubst:\s*}}}
  • Replace: {{
[edit]

Aimed at Nav template translation, so handles bulleted links, optional pops or bolding, and specific lang prefix:

Unpiped links (e.g., * ''[[:fr:Documents maçonniques]]''):

  • Search: ^\*\s*('*)?\[\[:fr:([^]]+)\]\]('*)?
  • Replace: \1{{ill|ENGLISHNAME|fr|\2|v=sup}}\3

Piped links (.e.g., * ''[[:fr:Idées (revue, 1941-1944)|Idées]]''):

  • Search: ^\*\s*('*)?\[\[:fr:([^]|]+)\|?([^]]+)?\]\]('*)?$
  • Replace: \1{{ill|ENGLISHNAME|fr|\2|lt=\3|v=sup}}\4

For bios or proper names, duplicate the Foreign name in the English article field:

  • Search: ^\*\s*('*)?\[\[:fr:([^]|]+)\|?([^]]+)?\]\]('*)?$
  • Replace: * \1{{ill|\2|fr|\2|lt=\3|v=sup}}\4

Examples:

  • * ''[[:fr:Le Juif et la France]]''
    • * ''{{ill|Le Juif et la France|fr|Le Juif et la France|lt=|v=sup}}''
  • * ''[[:fr:Combats]]''
    • * ''{{ill|Combats|fr|Combats|lt=|v=sup}}''
  • * ''[[:fr:Idées (revue, 1941-1944)|Idées]]''
    • * ''{{ill|Idées (revue, 1941-1944)|fr|Idées (revue, 1941-1944)|lt=Idées|v=sup}}''
  • * [[:fr:Publications antisémites en France]]
    • {{ill|Publications antisémites en France|fr|Publications antisémites en France|lt=|v=sup}}

Section demote

[edit]
  • Search: ^(={2,5})([^=].*?)\1
  • Replace: =\1\2\1=

Subsection promote

[edit]
  • Search: ^=(={2,5})([^=].*?)\1=
  • Replace: \1\2\1

Reflib section from last-first-year

[edit]
  • Search: ^\*\s*(.*?)\|last=([^|]+)\s\|first=([^|]+)\s*\|year=(\d+)(.*?)$
  • Replace:
    == \2-\4 ==
    \1|last=\2 |first=\3|year=\4\5

Regionalize English: AE to BE

[edit]

Zed to ess (recognize ⟶ recognise)

[edit]
  • Search: /((?:[a-z-[aeiuo]]{0,3}[aeiouy]{1,2}){1,}[a-z-[aeiuo]]{0,3}[iy])z((?:e|ed|es|er|ers|ing)\b)/g
  • Replace: $1s$2