Jump to content

Help:Searching/Regex

From Wikipedia, the free encyclopedia

To perform a regex search, use the ordinary search box with the syntax insource:/regex/ or intitle:/regex/.

[edit]

All pages on Wikipedia are scanned and indexed by Wikipedia's own search engine. The entire wiki is treated as one "full text" kept in a separate database (an "index") built just for searching. It's like the index in a book, but practically every word and every number is indexed to every page.[1]

Since each word in the prebuilt search index already points to the pages that contain it, a keyword search usually corresponds to a single record lookup in the index. (This is also true for phrases, to a certain extent.) "Index searches" take basically no time to execute. They are cheap and plentiful.

There are separate indexes kept updated for:

  • titles
  • visual content
  • wikitext
  • templates

Any text transcluded from a template is indexed as if it were really present on its target page. (In other words, by default, a keyword search is done on the text of the rendered Wikipedia page, not on the page source itself. However, you can change this by using insource:keyword to search the source markup instead of the rendered page.)

Preparing and maintaining the search indexes is done by Wikipedia's servers, in the background, in near real time. As soon as you save the page, a few seconds later you can search for the changes you just made. For templates that are transcluded onto many many pages, the propagation of those changes to all the pages in the index might take a while.

The index is based on alphanumeric characters; it stores no information on non-alphanumeric characters. If you type any punctuation or brackets into the search box when doing an indexed search, those characters will be silently discarded.

A basic indexed search

  • searches only article space by default.
  • matches only letters and numbers. This is usually not a problem.
  • lands a lot of search results. You rely heavily on page ranking rules. You then refine search results based on the topmost pages. This is done with the not filter, signified by a minus sign attached to the front of the unwanted word to filter out page-hit noise you could not have predicted.
  • is an "aggressive matcher" including as many pages as it can by matching all forms of each word you enter.
[edit]

Instead of doing a basic indexed search on keywords, you can perform a regex search, which bypasses the index. A regex search scans the text of each page on Wikipedia in real time, character by character, to find pages that match a specific sequence or pattern of characters. Unlike keyword searching, regex searching is by default case-sensitive, does not ignore punctuation, and operates directly on the page source (MediaWiki markup) rather than on the rendered contents of the page.

To perform a regex search, use the ordinary search box with the syntax insource:/regex/ or intitle:/regex/. The expression regex denotes a regular expression in MediaWiki-flavored regular expression syntax.

Use regexes responsibly

[edit]

Because regex searching scans each page character by character, it is generally much slower than an index search. You can—and should—add additional search terms when using insource:/regex/ to reduce the amount of text being processed. For example:

  • polish insource:/polish/ finds pages that match a case-insensitive stemmed keyword search for "polish" (including "polished" or "polishing"); then does a case-sensitive regex search within those pages. Only pages that match both filters are returned.
  • insource:polish insource:/polish/ is similar, but starts with a case-insensitive search of the source markup instead of the rendered page (so it will find usages like Poles, and not find transclusions).

Adding an index-based search term to reduce the amount of text being scanned is important simply to make your own regex search finish in a reasonable amount of time. Regex searches that take too long will "time out" and return only partial results. Overuse of slow regex searches might cause temporary throttling of the feature for yourself and/or everyone on Wikipedia. (However, you cannot affect the site performance of Wikipedia as a whole simply by abusing regex search.) Remember that a single regex search can take multiple seconds, and there are currently 48,446,138 registered users on Wikipedia. Use regex search responsibly.

Metacharacters

[edit]

MediaWiki's regular expression syntax works like this:

  • Most characters represent themselves. For example, insource:/C-3p0/ will search for pages containing the literal string "C-3p0" (case-sensitive).
  • The following metacharacters are treated specially: . + * ? | { [ ] ( ) " \ # @ < ~. Any metacharacter can be escaped by preceding it with a backslash \. Preceding any other character with a backslash is harmless. For example, insource:/yes\.\no/ will search for pages containing the literal string "yes.no" (case-sensitive). Regex experts should note that \n does not mean "newline," \d does not mean "digit," and so on: In MediaWiki syntax, the only use of \ is to escape metacharacters.
  • / is special because it indicates the end of the regex. For example, insource:/yes/no/ is treated the same as insource:/yes/ no (because the keyword search for no/ ignores punctuation). The / character must be backslash-escaped everywhere it appears inside a regex – even inside square brackets or quotation marks.
  • . matches any single character. For example, insource:/yes.no/ is matched by yes/no, yes no, yesuno, etc.
  • ( ) group a sequence of characters into an atomic unit.
  • | goes between two sequences and matches either of them. For example, insource:/a(g|ch)e/ matches either age or ache.
  • + matches the preceding character or group one or more times. For example, insource:/ab+(cd)+/ is matched by abcd, abbbcd, abbcdcd, etc. insource:/a(g|ch)+e/ matches agge, achgchchggche, etc.
  • * matches the preceding character or group any number of times (including zero). For example, insource:/ab*(cd)*/ is matched by a, abbb, acdcd, etc.
  • ? matches the preceding character or group exactly zero or one times.
  • { } match the preceding character or group a fixed number of times. For example, insource:/[a-z]{2}/ matches exactly 2 lowercase letters in a row. insource:/[a-z]{2,4}/ matches any string of 2, 3, or 4 lowercase letters. insource:/[a-z]{2,}/ matches any string of 2 or more lowercase letters.
  • [ ] introduce a character class, which matches a single instance of any of the characters in the class. For example, insource:/[Pp]olish/ matches both Polish and polish. Characters inside square brackets generally don't have to be escaped, although escaping them remains harmless, and / still needs to be escaped everywhere. For example, insource:/[.\/\]\n]/ matches a single instance of ., /, ], or n.
  • Inside a character class, the character ^ (if it appears first of all) represents negation, and the character - (unless it appears first or last) represents a range. For example, insource:/[A-Za-z0-9_]/ matches any alphanumeric character or underscore, and insource:/[^A-Za-z]/ matches any non-alphabetic character.
  • < > stand for numbers treated as numbers, not characters. For example, insource:/AD <476-1453>/ is matched by AD 476, AD 477, ... AD 1452, AD 1453, but not AD 1474. (But it will also match the first six characters of AD 4760.)
  • ~ "looks ahead" and negates the next character or group. For example, insource:/crab~(cake)c/ should match the first five characters of crabclaw but not the first five characters of crabcake.[clarification needed]

There are a few additional quirks of the syntax:

  • The metacharacter @ is a synonym for .* (match any sequence of characters at all).
  • A search insource:/0/ fails, although insource:/1/ and insource:/\0/ both succeed.
  • " " are an escape mechanism, like square brackets or the backslash. For example, insource:/".*"/ means the same thing as insource:/\.\*/.
  • The character # is also a metacharacter and must be escaped.[clarification needed]
  • Regex experts should note that \n does not mean "newline," \d does not mean "digit," and so on.
  • Regex experts should note that ^ does not mean "beginning of text" and $ does not mean "end of text." Searching from the beginning or end of a Wikipedia page is not generally useful.

Workarounds for some character classes

[edit]

Although character classes \n, \s, \S are not supported, you may use these workarounds:

PCRE MediaWiki Description
\n [^ -􏿽] A newline (also a tabulation character can be found[1])
[^\n] [ -􏿽] Any character except a newline and tabulation
\s [^!-􏿽] A whitespace character: space, newline, or tabulation
\S [!-􏿽] Any character except whitespace

^ To exclude the tabulation character as well, copy it and add it to the character set.

In these ranges, " " (space) is the character immediately following the control characters, "!" is the character immediately following space, and "􏿽" is U+10FFFF, the last character in Unicode. Thus, the range from " " to "􏿽" includes all characters except for control characters (of which articles may contain newlines and tabulation), while the range from "!" to "􏿽" includes all characters except for control characters and space.

Notes

[edit]
  1. ^ When you do a basic keyword search on Wikipedia, you aren't scanning pages in real time; you are simply looking up an entry in the index. All content is at all times "known" and resides in indexes. So when you read something like "search for pages containing...", you can mentally replace "search for..." with "search the index for..."