Friday, January 7, 2011

How to use regular expression with php


Introduction to PHP Regex


Contents

  1. What is a regex?
  2. Where do I begin?
  3. Match beginning of a string
  4. How do I find a pattern at the end of a string?
  5. Meta Characters
  6. Special Sequences
  7. Putting it together
  8. Modifiers and Assertions
  9. Evaluation using preg_replace(),
  10. Look Aheads
  11. Look behinds
  12. Make Search Un-greedy
  13. Delimiters
  14. Credits
  15. Examples
  16. Cheat Sheet

What is a regex?


At its most basic level, a regex can be considered a method of pattern matching or matching patterns within a string. In PHP the most oft used is PCRE or "Perl Compatible Regular Expressions". Here we will try to decypher the meaningless hieroglyphics and set you on your way to a powerful tool for use in your applications. Do not try to understand all this in a single sitting. Instead, take in a little and come back as you grasp various concepts.

Where do I begin?


At the beginning.
Lets create a string.

<?php// create a string$string 'abcdefghijklmnopqrstuvwxyz0123456789';
// echo our stringecho $string;
?>
If we simply wanted to see if the pattern 'abc' was within our larger string we could easily do something like this:
<?php// create a string$string 'abcdefghijklmnopqrstuvwxyz0123456789';

echo 
preg_match("/abc/"$string);?>
The code above will echo '1'. This is because it has found 1 match for the regex. That means it has found the pattern 'abc' once in the larger string. preg_match() will count zero if there is no matches, or one if it finds a match. This function will stop searching after the first match. Of course, you would not do this in a real world situation as php has functions for this such as strpos() and strstr() which will do this much faster.

Match beginning of a string


Now we wish to see if the string begins with abc. The regex character for beginning is the caret ^. To see if our string begins with abc, we use it like this:
<?php// create a string$string 'abcdefghijklmnopqrstuvwxyz0123456789';
// try to match the beginning of the stringif(preg_match("/^abc/"$string))
    {
    
// if it matches we echo this line
    
echo 'The string begins with abc';
    }
else
    {
    
// if no match is found echo this line
    
echo 'No match found';
    }
?>
From the code above we see that it echo's the line
The string begins with abc
The forward slashes are a delimeter that hold our regex pattern. The quotations are used to 'wrap it all up'. So we see that using the caret(^) will give us the beginning of the string, but NOT whatever is after it.

What if I want case insensitive?

If you used the above code to find the pattern ABC like this:
if(preg_match("/^ABC/", $string))
the script would have returned the message:
No match found
This is because the search is case sensitive. The pattern 'abc' is not the same as 'ABC'. To match both 'abc' and 'ABC' we need to use a modifier to make the search case in-sensitive. With php regex, like most regex, is use 'i' for insensitive. So now our script might look like this:
<?php// create a string$string 'abcdefghijklmnopqrstuvwxyz0123456789';
// try to match our patternif(preg_match("/^ABC/i"$string))
        {
    
// echo this is it matches
        
echo 'The string begins with abc';
        }
else
        {
    
// if not match is found echo this line
        
echo 'No match found';
        }
?>
Now the script will find the pattern abc. It would also match any case in-sensitive combination of abc, ABC, Abc, aBc, and so on.
More on modifiers later.

How do I find a pattern at the end of a string?


This is done in much the same way as with finding a a pattern at the beginning of a string. A common mistake made by many is to use the $ character to match the end of a string. This is incorrect and \z should be used. Consider this..
preg_match("/^foo$/", "foo\n")
This will return true as $ is like \Z which is like (?=\z|\n\z). So when a newline is not wanted, $ should not be used. Also $ will match multiple times with the /m modifier whereas \z will not. Lets make a small change to the code from above by removing the caret(^) at the beginning of the pattern and putting \z at the end of the pattern, we will keep the case in-sensitive modifier in to match any case.
<?php// create a string$string 'abcdefghijklmnopqrstuvwxyz0123456789';
// try to match our pattern at the end of the stringif(preg_match("/89\z/i"$string))
        {
    
// if our pattern matches we echo this
        
echo 'The string ends with 89';
        }
else
        {
    
// if no match is found we echo this line
        
echo 'No match found';
        }
?>
The script now will show the line
The string ends with 89
because we have matched the end of the string with the pattern 89. Pretty easy stuff so far.

Meta characters


During our first look at regex we did some simple pattern matching. We also introduced the caret(^) and the dollar($). These characters have special meaning. As we saw, the caret(^) matched the beginning of a string and the dollar matched the end of a string. These characters, along with others are called Meta characters. Here is a list of the Meta characters used for regex:
  • . (Full stop)
  • ^ (Carat)
  • * (Asterix)
  • + (Plus)
  • ? (Question Mark)
  • { (Opening curly brace)
  • [ (Opening brace)
  • ] (Closing brace)
  • \ (Backslash)
  • | (Pipe)
  • ( (Opening parens)
  • ) (Closing parens)
  • } (Closing curly brace)
We will look at each of these during this tutorial, but it is important that you know what they are. If you wish to search a string that contains one of these characters, eg: "1+1" then you need to escape the the meta character with a backslash like this:
<?php// create a string$string '1+1=2';
// try to match our patternif(preg_match("/^1\+1/i"$string))
        {
    
// if the pattern matches we echo this
        
echo 'The string begins with 1+1';
        }
else
        {
    
// if no match is found we echo this line
        
echo 'No match found';
        }
?>
From the code above you will see the script print:
The string begins with 1+1
because it found the pattern 1+1 and ignored or escaped the special meaning of the + symbol. If you were to not escape the meta character and use the regex
preg_match("/^1+1/i", $string)
you would not find a match.
If you are looking for a backslash, you need to escape that also. But, we also need to escape the control character too, which is itself a backslash, hence we need to escape twice like this
\\\\

What do the other Meta characters do?

We have already seen the caret ^ and the dollar $ in action, so now lets look at the others, beginning with the square braces [ ]. These Meta characters are used for specifying a character class.
A what?
A Character Class. This is just a set of characters you wish to match. They can be listed individually like:
[abcdef]
or as a range seperated by a - symbol like:
[a-f]
<?php// create a string$string 'big';
// Search for a matchecho preg_match("/b[aoiu]g/"$string$matches);
?>
The above code will return 1. This is because preg_match() found a match. This code would also match the string 'bag' 'bog' 'big' but not match the string 'beg'. The character class range [a-f] is the same as [abcdef]. Think of it as [from a to f]. Once again, these are case sensitive, so [A-F] is not the same as [a-f].
Meta characters do not work inside of classes, so you do not need to escape them within the [ and the ]. You could have the class:
[abcdef$]
This would match the characters a b c d e f $. The dollar($) sign within the class is just a simple dollar sign and contains no special meaning within it.
Now that we have established that Meta characters have no special meaning inside a class, we will now see how to use some of them inside a class. Yes I know I said they have no special meaning inside a class, and this is true most of the time, but there are times when they do have a meaning.
The powerful nature of regex will also allow us to match patterns NOT within a range. To do this we again use the caret( ^) as the first character of the class. If the caret( ^) appears anywhere else, it is simply regarded as a caret(^) with no special meaning. Here we will see how to match any charater exept b.
<?php// create a string$string 'abcefghijklmnopqrstuvwxyz0123456789';
// echo our stringpreg_match("/[^b]/"$string$matches);
// loop through the matches with foreachforeach($matches as $key=>$value)
        {
        echo 
$key.' -> '.$value;
        }
?>
From the code above, we get the result
0 -> a
What has happened is that preg_match() function has found the first character that does not match the pattern /[^b]/. Lets make a small change to our script and this time use preg_match_all() to match all characters within the string that do not match the pattern /[^b]/
<?php// create a string$string 'abcefghijklmnopqrstuvwxyz0123456789';
// try to match all characters not within out patternpreg_match_all("/[^b]/"$string$matches);
// loop through the matches with foreachforeach($matches[0] as $value)
        {
        echo 
$value;
        }
?>
As you can see from the results of the script above, it prints out all the characters of the string that do NOT match the pattern "b"
acefghijklmnopqrstuvwxyz0123456789
If we were to take this one step further, we could use it to filter out all the numbers from a string.
<?php// create a string$string 'abcefghijklmnopqrstuvwxyz0123456789';
// match any character that is not a number between 0 and 9preg_match_all("/[^0-9]/"$string$matches);
// loop through the matches with foreachforeach($matches[0] as $value)
        {
        echo 
$value;
        }
?>
The above script will return the string:
abcefghijklmnopqrstuvwxyz
Lets see what has happened in the above script. We used preg_match_all() to match our pattern. The pattern used the caret( ^) symbol with the class [] to match all character that do NOT match the pattern range 0-9.
So, you can simply remember that the ^(caret) means NOT when used inside a character class.

Still with us?

Ok, moving on, we come to the most used Metacharacter, the backslash(\).
The backslash(\) can be used in several ways with regex. The first we will deal with is escaping. The backslash(\) can be used to escape all Meta Characters, including itself, so you can match them in patterns. If our string looked like
"This is a [templateVar]"
and we wanted to search for [templateVar] with the string, we could use the following code:
<?php// create a string$string 'This is a [templateVar]';
// try to match our patternpreg_match_all("/[\[\]]/"$string$matches);
// loop through the matches with foreachforeach($matches[0] as $value)
        {
        echo 
$value;
        }
?>
From the above snippet, we get the result
[]
This is because we have specified that we wanted to match all characters that matched []. Without the backslashes the pattern would look like "/[[]]/", but we had to escape the [ and the ] characters we wanted to search for. Similarly, you must escape a backslash if you have a string that looks like
c:\dir\file.php
you would need to use \\ in your pattern.
The backslash is also used to signal various special sequences.
The next meta character is the . dot, or full stop.
The dot matches any character except a line break such as \r and \n. So, we can match any single character, except for a line break. To make . match any character, including \n, you need to use the /s flag. First we will see how to use the .without the \s flag to match a single character.
<?php// create a string$string 'sex at noon taxes';
// look for a matchecho preg_match("/s.x/"$string$matches);
?>
As you can see, the result is 1. This is because preg_match() has found one match. This example would also match sax, six, sox, sux, and s x, but would not match stix.
Now lets see if we can match a new line character, for our example we will use \n.
<?php// create a string$string 'sex'."\n".'at'."\n".'noon'."\n".'taxes'."\n";
// echo the stringecho nl2br($string);
// look for a matchecho preg_match_all("/\s/"$string$matches);
?>
The code above will return this:
sex
at
noon
taxes
4
First we echo out the string to see the new lines, then we see a 4 at the bottom. this is because preg_match_all() found four matches for the new line character \n when we used the \s flag. More on flags later in the section on Special Sequences..
Next in the meta characters is the asterix * character. This meta character matches zero or more occurences of the character immediately before it. This means that the character may or may not be there. So the code .* would match any number of any characters. eg:
<?php// create a string$string 'php';
// look for a matchecho preg_match("/ph*p/"$string$matches);
?>
Again we see the result is 1 as we have found 1 match. In the above example its match was with one h character. This same regex would match pp (zero h characters), and phhhp (3 h characters).
This of course brings us to the + meta character. The behaves in a similar manner as the asterix * with the exception that the + matches one or more times where the asterix * matches zero or more times. As we saw in the previous example, the asterix * would match the string 'pp'. However with + meta character would not. Consider this code:
<?php// create a string$string 'pp';
// look for a matchecho preg_match("/ph+p/"$string$matches);
?>
The above script will echo 0. This is because the h character did not appear one of more times in the string.
Our next meta character is the ?. This character acts like the preceding meta characters, except the ? will match 0 or 1 occurence of the character or regular expression immediately preceding it. In the following code, we will see how this can be helpful if we wish to make something optional, like a phone number in Australia is formatted 1234-5678.
<?php
// create a string$string '12345678';
// look for a matchecho preg_match("/1234-?5678/"$string$matches);
?>
The above code will return 1. This is because the ? character has matched the - (hyphen) character zero times. Changing the string to 1234-5678 would yeild the same result as seen below.
<?php
// create a string$string '1234-5678';
// look for a matchecho preg_match("/1234-?5678/"$string$matches);
?>
Next we have the curly braces or the { } meta characters. These simply match a specific number of instances of the preceding character or range of characters. Here we will match the letters PHP which must be followed by exactly 3 digits.
<?php
// create a string$string 'PHP123';
// look for a matchecho preg_match("/PHP[0-9]{3}/"$string$matches);
?>
As we can see our pattern PHP 0-9(any numbers from 0 to 9) {3} (three times) has matched because this is the format of our string.

Special Sequences


The backslash(\) is also used for sequences. What are sequences?
Sequences are predefined sets of characters you can use in a class.
  • \d - Matches any numeric character - same as [0-9]
  • \D - Matches any non-numeric character - same as [^0-9]
  • \s - Matches any whitespace character - sames as [ \t\n\r\f\v]
  • \S - Matches any non-whitespace character - same as [^ \t\n\r\f\v]
  • \w - Matches any alphanumeric character - same as [a-zA-Z0-9_]
  • \W - Matches any non-alphanumeric character - same as [^a-zA-Z0-9_]
So, with this in mind, we could use these as short-cuts in our patterns to reduce the length of our code. See if you can tell what the following script does:
<?php// create a string$string 'ab-ce*fg@ hi & jkl(mnopqr)stu+vw?x yz0>1234<567890';
// match our pattern containing a special sequencepreg_match_all("/[\w]/"$string$matches);
// loop through the matches with foreachforeach($matches[0] as $value)
        {
        echo 
$value;
        }
?>
Well, lets see what we have done. We have matched all (preg_match_all) characters within the class ( [] ) that are alphanumeric (\w). So the resultant output here is:
abcefghijklmnopqrstuvwxyz0123456789
This can be useful for stripping nasty spaces or nasty characters from usernames etc.
Using this same method, we could make sure a string does not begin with a number
<?php// create a string$string '2 bad for perl';
// echo our stringif(preg_match("/^\d/"$string))
    {
    echo 
'String begins with a number';
    }
else
    {
    echo 
'String does not begin with a number';
    }                                                                                                       
?>
The next Meta character is the fullstop(.). A fullstop(.) is used to match any character one time only. This can be good if you wish to search a string for any character.
<?php// create a string$string 'abcdefghijklmnopqrstuvwxyz0123456789';
// try to match any characterif(preg_match("/./"$string))
        {
        echo 
'The string contains at least on character';
        }
else
        {
        echo 
'String does not contain anything';
        }
?>
Of course, this code contains at least one character. Other uses for the fullstop(.) could be to check if any character is before a number. Think up some of your own uses and try them.
Earlier we had a look at meta characters and we had a problem of matching a new line character because the . meta character does not match a new line such as \n Here we can use the \s flag to match any whitespace character. For our example we will use \n.
<?php// create a string$string 'sex'."\n".'at'."\n".'noon'."\n".'taxes'."\n";
// echo the stringecho nl2br($string);
// look for a matchecho preg_match_all("/\s/"$string$matches);
?>
The code above will return this:
sex
at
noon
taxes
4
First we echo out the string to see the new lines, then we see a 4 at the bottom. this is because preg_match() found four matches for the new line character \n when we used the \s flag. More on flags later.

Putting it together


If you have come this far you now have the building blocks to match complex patterns. As we progress here, you will see how we use these building blocks in combination with each other. To move on, lets begin with the OR operator. The OR operator as we saw earlier is the pipe character |. Lets use it in a simple "Hello World" script.

<?php
// a simple string$string "This is a Hello World script";
// try to match the patterns This OR That OR Thereecho preg_match("/^(This|That|There)/"$string);?>
Now we can start to see some of the flexibility PHP's regular expressions allow. Lets now try to match Hello OR Jello in our string.

<?php// a simple string$string "This is a Hello World script";
// try to match the patterns Jello or Helloif(!preg_match("/(Je|He)llo/"$string))
        {
        echo 
'Pattern not found';
        }
else
        {
        echo 
'pattern found';
        }
?>
This works well and we can see that the patter matched. But it does not show which match is correct. To enable us to see which of the patterns matched we can extend preg_match() ability to hold results. The addition of a third arguement holds an array of matches. Consider this small addition to the code above.

<?php// a simple string$string "This is a Hello World script";
// try to match the patterns Jello or Hello
// put the matches in a variable called matches
preg_match("/(Je|He)llo/"$string$matches);
// loop through the array of matches and print themforeach($matches as $key=>$value)
    {
    echo 
$key.'->'.$value.'<br />';
    }
?>
The above code gives us a result that looks like this..
  • 0->Hello
  • 1->He
$matches[0] contains the text that matched the full pattern, ie. Hello.
$matches[1] has the text that matched the first captured subpattern.

Modifiers and Assertions

As we saw earlier in this tutorial we were able to create a regular expression that was case insensitive by the use of /i. This is a modifier and is one of many used in regular expressions for perform modifications to the behaviour of the pattern matching. Here is a list of regex modifiers and assertions used in PHP.

Modifiers

  • i - Ignore Case, case insensitive
  • U - Make search ungreedy
  • s - Includes New line
  • m - Multiple lines
  • x - Extended for comments and whitespace
  • e - Enables evaluation of replacement as PHP code. (preg_replace only)
  • S - Extra analysis of pattern
  • Assertions

  • b - Word Boundry
  • B - Not a word boundary
  • A - Start of subject
  • Z - End of subject or newline at end
  • z - End of subject
  • G - First matching position in subject
As you can see from the above list, there are many ways to modifiy your regular expressions, lets look at them one by one begining with the earlier example of using i.
<?php// create a string$string 'abcdefghijklmnopqrstuvwxyz0123456789';
// try to match our patternif(preg_match("/^ABC/i"$string))
{
// echo this is it matchesecho 'The string begins with abc';
}
else
{
// if not match is found echo this lineecho 'No match found';
}
?>
If you have read the earlier part of this tutorial it will be no surprise that "ABC" matched with abc because we used the case insensitive modifier to either abc or ABC. Moving on from the i modifier we have the s modifier. The s modifier adds matching for new line characters. This was demonstrated earlier in this tutorial also, but will be reproduced here once again for our country listeners. First we match without the s modifier.
<?php

    
/*** create a string with new line characters ***/
    
$string 'sex'."\n".'at'."\n".'noon'."\n".'taxes'."\n";

    
/*** look for a match
    echo preg_match("/sex.at.noon/", $string, $matches);

?>
The above pattern will provide no matches because the dot (.) does not match newline a new line With the addition of the s modifier we can match the new lines.
<?php
    
/*** create a string with new line characters ***/
    
$string 'sex'."\n".'at'."\n".'noon'."\n".'taxes'."\n";

    
/*** look for a match using s modifier ***/
    
echo preg_match("/sex.at.noon/s"$string$matches);?>
The above code will echo 4 as it has found 4 newline characters.
Having extended our string variable to include new lines, brings us to our next modifier, the m modifier. This modifier is magic. It treats a string as having only a single newline character at the end, even if there are multiple new lines in our string. How it does this is by trying to match the characters immediately before and after any newline character that is not the end of the string. So, if there is no newline characters in the string, the this modifier does nothing.

<?php// create a string$string 'sex'."\n".'at'."\n".'noon'."\n".'taxes'."\n";
// look for a matchif(preg_match("/^noon/im"$string))
{
echo 
'Pattern Found';
}
else
{
echo 
'Pattern not found';
}
?>
Of course now with the m modifier the regex will match. Try without the m modifier and the pattern will not be found because it has stopped looking at the first newline character and assumed this to be the end.
In the above example we have used im for the modifiers. This to show we can use multiple modifiers for whatever purpose we require. The code above would have matched NOON or Noon also as we used the i which we have seen makes the regex case insensitive.
The x modifier allows us to put our regex on several lines, thus making long and complex regular expressions easier to read and debug and to allow for comments within the regex itself. Lets consider our previous example above, and add some comments, and put the regex on multiple lines.

<?php// create a string$string 'sex'."\n".'at'."\n".'noon'."\n".'taxes'."\n";
// create our regex using comments and store the regex
// in a variable to be used with preg_match
$regex ='
/     # opening double quote
^     # caret means beginning of the string
noon  # the pattern to match
/imx
'
;
// look for a matchif(preg_match($regex$string))
        {
        echo 
'Pattern Found';
        }
else
        {
        echo 
'Pattern not found';
        }
?>
Well, this particular regex did not need much explanation but just demonstrates how to insert comments over multiple lines to document your code. Lets move on to some of the other modifiers. The e modifier is a special modifier that that evaluates. We have dedicated a special section just for it. So lets move onto the S modifier. This modifer provides us with some analysis before matching patterns that are not anchored. So, if a pattern does not have a single fixed starting position, like..

<?php/*** fixed starting position ***/preg_match("/abc(.*?)hij/"$string);?>
If a pattern is to be used more than once, analysis may be able to help speed up the time taking doing multiple matches. In previous examples we have used non-anchored patterns and one instance is repeated here with the addition of the S modifier.
<?php// create a string$string 'ab-ce*fg@ hi & jkl(mnopqr)stu+vw?x yz0>1234<567890';
// match our pattern containing a special sequencepreg_match_all("/[\w]/S"$string$matches);
// loop through the matches with foreachforeach($matches[0] as $value)
        {
        echo 
$value;
        }
?>
Next we move on to word boundaries. A word boundary is created between two \b modifiers. These are special "bookend" type modifiers that allow us to specify exactly what must be matched. The text must match excatly between the "bookends". The following two scripts will demonstrate how the text must match exactly. eg: a match for "cat" will not match "catalog", but lets see it in practice.

<?php
/*** a simple string ***/$string 'eregi will not be available in PHP 6';
/*** here we will try match the string "lab" ***/if(preg_match ("/\blab\b/i"$string))
        {
    
/*** if we get a match ***/
        
echo $string;
        }
else
        {
    
/*** if no match is found ***/
        
echo 'No match found';
        }
?>
From the code above, we see we are trying to match the pattern "lab" which does occur inside the string in the word "available". But because we have used a word boundary, it does not match. Lets try again but have the word "lab" on its own.

<?php
/*** a simple string ***/$string 'eregi will remain in the computer lab';
/*** here we will try match the string "lab" ***/if(preg_match ("/\blab\b/i"$string))
        {
    
/*** if we get a match ***/
        
echo $string;
        }
else
        {
    
/*** if no match is found ***/
        
echo 'No match found';
        }
?>
Now we see from the above code that we have a match, and that the code echoes the string. This is because the pattern is matched within the word boundary. That is, \blab\b.
If the \b occurs within a character class like [\b], then it matches a single backspace character and not a word boundary.
This brings us to the next modifier, the \B modifier. It is related to the previous modifier but rather than stipulate a word boundary, it stipulates a non-word boundary. This can be useful if you wish to do something like match text that is contained within a word but not at the start or end of the word. Consider this code.

<?php
/*** a little string ***/$string 'This lathe turns wood.';
/*** match word boundary and non-word boundary ***/if(preg_match("/\Bthe\b/"$string))
    {
    
/*** if we match the pattern ***/
    
echo 'Matched the pattern "the".';
    }
else
    {
    
/*** if no match is found ***/
    
echo 'No match found';
    }
?>
When the code above is run, it will find a match for the pattern "the". This is because we have used a non-word boundary to specify that the pattern must end with "the" but which has at least one other character before the "t". Lets change the string and try again.

<?php
/*** a little string ***/$string 'The quick brown fox jumps over the lazy dog.';
/*** match word boundary and non-word boundary ***/if(preg_match("/\Bthe\b/"$string))
    {
    
/*** if we match the pattern ***/
    
echo 'Matched the pattern "the".';
    }
else
    {
    
/*** if no match is found ***/
    
echo 'No match found';
    }
?>
This time we find that No match is found is echoed. This is because the non-word boundary has not found another char before the "t" in the pattern "the".
The final modifier we will look at is the \U modifier. By default, PCRE is greedy. Not that it will eat your last biscuit, but that it will try to match many matches as it can, unless, we tell it not to, which can cause a lot of backtracking. More backtracking, the slower the matching.
This means it will try to match every character (except new line) all the way to the end of the string, and then will work backward until it finds the end. To make the pattern quantifier miserly, or non-greedy you use the pattern quantifier limiter ?. This tells Perl to match as few as possible of the preceding symbol before continuing to the next part of the pattern. Or, the \U modifier can make our regex non-greedy.
Unless you have a strong understanding of regular expressions, it is not advisable to switch the default behavior. This can often lead to confusion. In a previous examples we have seen non-greedy regular expressions using (.*?), here we will use the \U modifier.

<?php
/*** a simple string ***/$string 'foobar foo--bar fubar';
/*** try to match the pattern ***/if(preg_match("/foo(.*)bar/U"$string))
    {
    echo 
'Match found';
    {
else
    {
    echo 
'No match found';
    }
?>

Evaluation with preg_replace.


If we have grasped the above we can move on to the e modifier. This modifier evaluates the replacement arguments when passed to preg_replace(). )We have not touched on preg_replace() yet, so a quick demonstration here will get us in the swing.

<?php// create a string$string 'We will replace the word foo';
// substitute the word for and put in bar$string preg_replace("/foo/"'bar'$string);
// echo the new stringecho $string;?>
You can see this does a simple replacement and in the real world str_replace() would be a lot faster but this example leads us nicely into the next. One of the strengths ofpreg_match() is that it will take an array in the same way asstr_replace() does.

<?php
// create a string with some template vars. the string and
// the vars would easily have been called from a template. 
$string 'This is the {_FOO_} bought to you by {_BAR_}';
// create an array of template vars
// of course, each variable could be an array itself
$template_vars=array("FOO" => "The PHP Way""BAR" => "PHPro.orG");
// preg replace our variables and evaluate them $string preg_replace("/{_(.*?)_}/ime""\$template_vars['$1']",$string);
// echo the new stringecho $string;
?>
Without the e modifier this code would echo
This is a $template_vars[FOO] and this is a $template_vars[BAR]
The e modifier has evaluated or interpolated the php variables within the matches. Totally cool and now you begin to see just how bloated most template systems are.

Look Aheads


A look ahead does exactly what the name suggests, it looks ahead for a pattern to match. Look aheads come in two flavours, negative and positive. Lets first look at the negative look ahead. A negative look ahead which is denoted with?!. This is useful for checking if a pattern is not in front of the match we wish. Lets take a look at a simple example.

<?php
/*** a simple string ***/$string 'I live in the whitehouse';
/*** try to match white not followed by house ***/if(preg_match("/white+(?!house)/i"$string))
    {
    
/*** if we find the word white, not followed by house ***/
    
echo 'Found a match';
    }
else
    {
    
/*** if no match is found ***/
    
echo 'No match found';
    }
?>
As you can see, no match is found. This is because the word white is followed by house. Lets change the string a little and run through again.

<?php
/*** a simple string ***/$string 'I live in the white house';
/*** try to match white not followed by house ***/if(preg_match("/white+(?!house)/i"$string))
        {
        
/*** if we find the word white, not followed by house ***/
        
echo 'Found a match';
    }
else
        {
        
/*** if no match is found ***/
        
echo 'No match found';
        }
?>
Now we see that we have a match, as the word white is not follow immediately by house as in whitehouse. Lets move on to a positive lookahead. This is denoted by ?= and looks ahead to see if a pattern is there. Lets see it in action.

<?php
/*** a string ***/$string 'This is an example eg: foo';
/*** try to match eg followed by a colon ***/if(preg_match("/eg+(?=:)/"$string$match))
    {
    
print_r($match);
    }
else
    {
    echo 
'No match found';
    }
?>
The above code is looking for the pattern eg followed by a colon, and so returns a match of eg. But what if we wanted to find something before the colon in the above example, or before house in the earlier examples. For these, we need to a lookbehind.

Look Behinds


Once again, the name says it all, a lookbehind looks behind to see if it can match a pattern. And like the lookaheads there are positive lookbehinds and negative lookbehinds. A positive lookbehind looks like ?<=Lets try it on a string from and earlier example in the Look Ahead section.

<?php
/*** a simple string ***/$string 'I live in the whitehouse';
/*** try to match house preceded by white ***/if(preg_match("/(?<=white)house/i"$string))
        {
        
/*** if we find the word white, not followed by house ***/
        
echo 'Found a match';
        }
else
        {
        
/*** if no match is found ***/
        
echo 'No match found';
        }
?>
Here we find a match as we have found the pattern house that is immediately preceded by the pattern white. The regex engine has looked behind house and completed the match. But what if wanted to be sure the pattern was NOT preceded by the word white. This is where we would use a negative lookbehind. A negative lookbehind is denote like this ?<!Consider the following code.

<?php
/*** a simple string ***/$string 'I live in the whitehouse';
/*** try to match house preceded by white ***/if(preg_match("/(?<!white)house/i"$string))
        {
        
/*** if we find the word white, not followed by house ***/
        
echo 'Found a match';
        }
else
        {
        
/*** if no match is found ***/
        
echo 'No match found';
        }
?>
As you can see from running this code, no match is found. This is because the negative lookbehind did not find the pattern "house" without the pattern "white" before it. Lets change the colour of the house, white seems much to virginal for this govt office.

<?php
/*** a simple string ***/$string 'I live in the bluehouse';
/*** try to match house preceded by white ***/if(preg_match("/(?<!white)house/i"$string))
        {
        
/*** if we find the word white, not followed by house ***/
        
echo 'Found a match';
        }
else
        {
        
/*** if no match is found ***/
        
echo 'No match found';
        }
?>
A slight modification of the string by changing whitehouse to bluehouse sees that we have a match because now the pattern "house" is not preceded by the pattern white. The regex engine has looked behind the pattern "house" and does not find the pattern "white", so all is well.

By default, PHP regular expressions are greedy by default. This means quanifiers such as *+? would consume as many characters as are available. Lets look at then again.
  • * - zero or more characters, same as {0, }
  • + - 1 or more characters, same as {1, }
  • ? - zero or one character, same as {0,1}
Consider the following regex

<?php/*** 4 x and 4 z chars ***/$string "xxxxzzzz";
 
/*** greedy regex ***/preg_match("/^(.*)(z+)$/",$string,$matches);
/*** results ***/echo $matches[1];
echo 
"<br />";
echo 
$matches[2]; ?>
The first pattern (.*) has matched all four "x" characters and 3 of the four "z" characters. It has matched greedily as many as it possible can. It is a simple matter to make these quantifiers ungreedy with the addition of the ? quantifier as in the following code.

<?php
/*** string of characters ***/$string "xxxxzzzz";
/*** a non greedy match ***/preg_match("/^(.*?)(z+)$/",$string,$matches);
/*** show the matches ***/echo $matches[1];
echo 
"<br />";
echo 
$matches[2];
?>
Now $matches[1] contains four "x" characters and $matches[2] contains four "z" characters. This is because the ? quantifier has changed the behaviour from matching as MANY characters as possible, to as FEW characters as possible.
Of course, this behaviour reversal would be rather tedious if you had long and complex pattern matches. To counter this the U modifier can be used to make the regular expression ungreedy. The code below shows the way.

<?php
/*** string of characters ***/$string "xxxxzzzz";
/*** a non greedy match ***/preg_match("/^(.*)(z+)$/U",$string,$matches);
/*** show the matches ***/echo $matches[1];
echo 
"<br />";
echo 
$matches[2];
?>
The pattern is the same as used in the first greedy example, and with the use of the U modifier the match has become ungreedy. The results are that $matches[1] contains four "x" characters and $matches[2] contains four "z" characters.

Gotcha

It is important to note that the U modifier does not merely make the search ungreedy, it reverses the behavior of the greediness. If a ? were used in conjuntion with the U modifier the non-greedy status of the ? would be reversed, as in this code.

<?php
/*** string of characters ***/$string "xxxxzzzz";
/*** a non greedy match ***/preg_match("/^(.*?)(z+)$/U",$string,$matches);
/*** show the matches ***/echo $matches[1];
echo 
"<br />";
echo 
$matches[2];
?>
Now the non-greedy status of the pattern has been reversed with the use of the U modifier and the above code will produce
xxxxzzz
z

Delimiters

This tutorial has seen many regular expressions and to delimit them the forward slash / has been used as the delimiter. Sometimes it is required that the pattern needs to match a forward slash. this can be escaped, but if there are many forward slashes, such as in a URL, this can become quite ugly and hard to read. Rather than polluting the pattern, another delimter can be used to hold the pattern. This following example uses the hash # character as a delimiter.

<?php
        
/*** get the host name from a url ***/
        
preg_match('#^(?:http://)?([^/]+)#i'"http://www.phpro.org/tutorials"$matches);

        
/*** show the host name ***/
        
echo $matches[1];?>
Many other characters can be used as delimiters, some of these are shown here.
  • /
  • @
  • #
  • `
  • ~
  • %
  • &
  • '
  • "

Credits

Thanks to Arpad Ray and Alex Topal for code and typo corrections through-out this tutorial.

Examples


Because PHPro is an amazing place, we could not finish this without a little revision and examples to refresh you and help you on your way. Lets start small and work our way up.
Lets match the beginning and end of a string or a pattern that occurs anywhere withing a string.

<?php
// the string to match against$string 'The cat sat on the mat';
// match the beginning of the stringecho preg_match("/^The/"$string);
// match the end of the stringecho preg_match("/mat\z/"$string); // returns 1

// match anywhere in the string
echo preg_match("/dog/"$string); // returns 0 as no match was found for dog.?>
Ok, that was easy, lets move onto matching more than a single pattern.

<?php
// the string to match against$string 'The cat sat on the matthew';
// matches the letter "a" followed by zero or more "t" charactersecho preg_match("/at*/"$string);
// matches the letter "a" followed by a "t" character that may or may not be presentecho preg_match("/at?/"$string);
// matches the letter "a" followed by one or more "t" charactersecho preg_match("/at+/"$string);
// matches a possible letter "e" followed by one of more "w" characters anchored to the end of the stringecho preg_match("/e?w+\z/"$string);
// matches the letter "a" followed by exactly two "t" charactersecho preg_match("/at{2}/"$string);
// matches a possible letter "e" followed by exactly two "t" charactersecho preg_match("/e?t{2}/"$string);
// matches a possible letter "a" followed by exactly 2 to 6 "t" chars (att attt atttttt) echo preg_match("/at{2,6}/"$string);
?>
Remember, these will only return 0 or 1, as preg_match() will stop looking after the first match. To match all the matches in a string you would use preg_match_all().

Cheat Sheet

Special Sequences

  • \w - Any “word” character (a-z 0-9 _)
  • \W - Any non “word” character
  • \s - Whitespace (space, tab CRLF)
  • \S - Any non whitepsace character
  • \d - Digits (0-9)
  • \D - Any non digit character
  • . - (Period) – Any character except newline

Meta Characters

  • ^ - Start of subject (or line in multiline mode)
  • $ - End of subject (or line in multiline mode)
  • [ - Start character class definition
  • ] - End character class definition
  • | - Alternates, eg (a|b) matches a or b
  • ( - Start subpattern
  • ) - End subpattern
  • \ - Escape character

Quantifiers

  • n* - Zero or more of n
  • n+ - One or more of n
  • n? - Zero or one occurrences of n
  • {n} - n occurrences exactly
  • {n,} - At least n occurrences
  • {n,m} - Between n and m occurrences (inclusive)

Pattern Modifiers

  • i - Case Insensitive
  • m - Multiline mode - ^ and $ match start and end of lines
  • s - Dotall - . class includes newline
  • x - Extended– comments and whitespace
  • e - preg_replace only – enables evaluation of replacement as PHP code
  • S - Extra analysis of pattern
  • U - Pattern is ungreedy
  • u - Pattern is treated as UTF-8

Point based assertions

  • \b - Word boundary
  • \B - Not a word boundary
  • \A - Start of subject
  • \Z - End of subject or newline at end
  • \z - End of subject
  • \G - First matching position in subject

Assertions

  • (?=) - Positive look ahead assertion foo(?=bar) matches foo when followed by bar
  • (?!) - Negative look ahead assertion foo(?!bar) matches foo when not followed by bar
  • (?<=) - Positive look behind assertion (?<=foo)bar matches bar when preceded by foo
  • (?<!) - Negative look behind assertion (?<!foo)bar matches bar when not preceded by foo
  • (?>) - Once-only subpatterns (?>\d+)bar Performance enhancing when bar not present
  • (?(x)) - Conditional subpatterns
  • (?(3)foo|fu)bar - Matches foo if 3rd subpattern has matched, fu if not
  • (?#) - Comment (?# Pattern does x y or z)

No comments:

Post a Comment