Jump to content

User:MarkMYoung/regular expressions

From Wikipedia, the free encyclopedia

Useful Regular Expressions

[edit]

There are too many places with incomplete or incorrect regular expressions scattered on the Internet and books are reluctant to list them because the author would likely have to compose an errata at some point. So, I am compiling a list of regular expressions (although these may also be incorrect, they are at least in one place). One decent source is the Regexp::Common module available from CPAN. However, I was prompted to maintain this page when I discovered Regexp::Common::net's (v2.120) regular expression for a decimal IPv4 address unit of (?k:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}) was incorrect because it would accept '05' as a decimal IP unit (which is octal) and it does not have an IPv6 regular expression.

CSV

[edit]

my $CSV_REGEXP = qr/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/;
my $tsv = join( "\t", split( $CSV_REGEXP, $csv ));

This does not remove the double-quotes which are now superfluous.

Decimal Number

[edit]

my $DECIMAL_REGEXP = qr/^([-+]?(?:(?:\d+\.?\d*)|(?:\d*\.?\d+)))$/;

Domain Name

[edit]

This regular expression merely ensures the domain only contains valid characters and checks for constituent domain length between 1 and 63.
my $DOMAINNAME_CHARSET_REGEXP = qr/[\w\-]/;
my $DOMAINNAME_UQ_REGEXP = qr/(?:(?:$DOMAINNAME_CHARSET_REGEXP){1,63})(?:\.(?:$DOMAINNAME_CHARSET_REGEXP){1,63})*/;

One can either use the specific or more general top-level domain regular expression.
my $DOMAINNAME_TLD_ENUM_REGEXP = qr/(?:\.[a-zA-Z]{2}|(?i:aero|biz|com|gov|info|jobs|museum|name|net|org))/i;
my $DOMAINNAME_TLD_REGEXP = qr/(?:\.[a-zA-Z]{2,6})/;
my $DOMAINNAME_FQ_REGEXP = $DOMAINNAME_UQ_REGEXP . $DOMAINNAME_TLD_REGEXP;

This regular expression excludes hyphens at the beginning, after a dot, consecutively, before a dot, and at the end.
my $DOMAINNAME_MISPLACED_HYPHENS_REGEXP = qr/(?:\A\-)|(?:\.\-)|(?:\-\-)|(?:\-\.)|(?:\-\z)/;

Keep in mind that something as simple as the word "a" or the text "0.7-1.2" matches as an unqualified hostname. So, this regular expression is good for validation, but not for searching.
my $domainLength_i = length( $hostName_str );
my $isValidDomainLength_b = (($domainLength_i >= 1) && ($domainLength_i <= 255));
my $isUqDomainName_b = ($isValidDomainLength_b && ($hostName_str !~ $DOMAINNAME_MISPLACED_HYPHENS_REGEXP) && ($hostName_str =~ $DOMAINNAME_UQ_REGEXP));

This is much better suited for searching.
my $isFqDomainName_b = ($isUqDomainName_b && ($hostName_str =~ $DOMAINNAME_TLD_REGEXP));

Here is a reasonable one-line regular expression that does not check for overall length greater than 255 or misplaced hyphens.
my $DOMAINNAME_FQ_REASONABLE_REGEXP = qr/(?:[\w\-]{1,63})(?:\.[\w\-]{1,63})*(?:\.[a-zA-Z]{2,6})/;

E-Mail Address / Username

[edit]

my $USERNAME_CHARSET_REGEXP = qr/[\w\!\#\$\%\&\'\`\*\+\/\=\?\^\{\|\}\~\-]/;
my $USERNAME_REASONABLE_REGEXP = qr/(?:$USERNAME_CHARSET_REGEXP)+(?:\.(?:$USERNAME_CHARSET_REGEXP)+)*/;
my $EMAIL_REASONABLE_REGEXP = $USERNAME_REASONABLE_REGEXP . qr/\@/ . $HOSTNAME_FQ_REGEXP;

IP Address

[edit]

my $IP4_REGEXP = qr/(?:\d|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5])(?:\.\d|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5]){3}/;
my $IP6_REGEXP = qr/(?:[\dA-Fa-f]{1,4})(?:\:[\dA-Fa-f]{0,4}){6}(?:\:[\dA-Fa-f]{1,4})/;
my $IP_REGEXP = $IP4_REGEXP . qr/|/ . $IP6_REGEXP;

MAC Address

[edit]

my $MAC_REGEXP = qr/(?:[0-9a-fA-F]{1,2}){6}/;

URI

[edit]

my $URI_REGEXP = qr/(?:([^:\/?#]+):)?(?:\/\/([^\/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?/;
my($scheme, $authority, $path, $query, $fragment) = $uri =~ m/$URI_REGEXP/;