| [ Team LiB ] |
|
22.4 Using Regular Expressions in PHP ScriptsThe basic function for executing regular expressions is ereg. This function evaluates a string against a regular expression, returning TRUE if the pattern described by the regular expression appears in the string. In this minimal form, you can check that a string conforms to a given pattern. For example, you can ensure that a U.S. postal ZIP code is in the proper form of five digits followed by a dash and four more digits. Listing 22.2 demonstrates this idea; Figure 22.2 shows the output. Listing 22.2 hecking a ZIP code
<?php
/*
** Check a ZIP code
** This script will test a zip code, which
** must be five digits, optionally followed by
** a dash and four digits.
*/
/*
** if zip submitted evaluate it
*/
if(isset($_REQUEST['zip']))
{
if(ereg("^([0-9]{5})(-[0-9]{4})?$", $_REQUEST['zip']))
{
print("{$_REQUEST['zip']} is a valid ZIP code.<br>\n");
}
else
{
print("{$_REQUEST['zip']} is <b>not</b> " .
"a valid ZIP code.<br>\n");
}
}
//start form
print("<form action=\"{$_SERVER['PHP_SELF']}\">\n");
print("<input type=\"text\" name=\"zip\">\n");
print("<input type=\"submit\">\n");
print("</form>\n");
?>
Figure 22.2. Output from Listing 22.2.
The script offers a form for inputting a ZIP code. It must have five digits and may be followed by a dash and four more digits. The functionality of the script hinges on the regular expression ^([0–9]{5})(-[0–9]{4})?$, which is compared to user input. It's instructive to examine this expression in detail. The expression starts with a carat. This causes the expression to match only from the beginning of the evaluated string. If this were left out, the ZIP code could be preceded by any number of characters, such as abc12345–1234, and still be a valid match. Likewise, the dollar sign at the end of the expression matches the end of the string. This stops matching of strings like 12345–1234abc. The combination of using a carat and a dollar sign allows us to match only exact strings. The first subexpression is ([0–9]{5}). The square-bracketed range allows only characters from zero to nine. The curly braces specify that there must be exactly five of these characters. The second subexpression is (-[0–9]{4})?. Like the first, it specifies exactly four digits. The dash is a literal character that must precede the digits. The question mark specifies that the entire subexpression may match once or not at all. This makes the four-digit extension optional. You can easily expand this idea to check phone numbers or dates. Regular expressions provide a neat way of checking variables returned from forms. Consider the alternative of nesting if statements and searching strings with the strpos function. You may also choose to have subexpression matches returned in an array. This is useful in situations where you need to break a string into components. The string a browser uses to identify itself is a good string for this method. Encoded in this string are the browser's name, version, and the type of computer it's running on. Pulling this information out into separate variables will allow you to customize your site based on the capabilities of the browser. Listing 22.3 is a script for creating a set of variables that aid in cloaking a site for a particular browser. For the purpose of illustration, we will customize a link based on the browser being used. If the user visits the page with Netscape Navigator, we will provide a link to the download page for Microsoft Internet Explorer. Otherwise, we'll put a link to Netscape's download page. This is an example of customizing content, but the same method can be used to decide whether to use advanced features. Listing 22.3 Evaluating user agent
<?php
//evaluate user agent like
//Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461)
ereg("^([[:alpha:]]+)/([[:digit:]\.]+)( .*)$",
$_SERVER['HTTP_USER_AGENT'], $match);
$browserName = $match[1];
$browserVersion = $match[2];
$browserDescription = $match[3];
//look for clues that this is MSIE
if(eregi("msie", $browserDescription))
{
//looking for something like:
//(compatible; MSIE 6.0; Windows NT 5.1; Q312461)
eregi("MSIE ([[:digit:]\.]+);",
$browserDescription, $match);
$browserName = "MSIE";
$browserVersion = $match[1];
}
print("You are using $browserName " .
"version $browserVersion!<br>\n" .
"You might want to try ");
if(eregi("mozilla", $browserName))
{
print("<a href=\"" .
"http://www.microsoft.com/ie/download/default.asp\">");
print("Internet Explorer");
print("</a> ");
}
else
{
print("<a href=\"" .
"http://www.netscape.com/computing/download/".
"index.html" ."\">");
print("Navigator");
print("</a> ");
}
print("for comparison.<br>\n");
?>
In this script the main ereg function is not used in an if statement. It assumes the browser will identify itself minimally as a name, a slash, and the version. The match array gets set with the parts of the evaluated string that match with the parts of the regular expression. There are three subexpressions for name, version, and any extra description. Most browsers follow this form, including Navigator and Internet Explorer. Since Internet Explorer always reports that it is a Mozilla (Netscape) browser, extra steps must be taken to determine if a browser is really a Netscape browser or an imposter. This is done with a call to eregi. If you are wondering why element zero is ignored, that's because the zero element holds the substring that matches the entire regular expression. In this situation it is not interesting. Usually, the zero element is useful when you are searching for a particular string in a larger context. For example, you may be scanning the body of a Web page for URLs. Listing 22.4 fetches the PHP home page and lists all the links on the page. The output is shown in Figure 22.3. Listing 22.4 Scanning text for URLs
<?php
//set URL to fetch
$URL = "http://www.php.net/";
//open file
$page = fopen($URL, "r");
print("Links at $URL<br>\n");
print("<ul>\n");
while(!feof($page))
{
//get a line
$line = fgets($page, 1024);
//loop while there are still URLs present
while(eregi("href=\"[^\"]*\"", $line, $match))
{
//print out URL
print("<li>{$match[0]}</li>\n");
//remove URL from line
$replace = ereg_replace("\?", "\?", $match[0]);
$line = ereg_replace($replace, "", $line);
}
}
print("</ul>\n");
fclose($page);
?>
Figure 22.3. Output from Listing 22.4.
The main loop of this script gets lines of text from the file stream and looks for href properties. If one is found in a line, it will be placed in the zero element of the match array. The script prints it out and then removes it from the line using the ereg_replace function. This function replaces text matched with a regular expression with a string. In this case the script replaces the href property with an empty string. The reason for finding the link and then removing it is that it is possible for two links to be on one line of HTML. The eregi function will match the first substring only. The solution is to find and remove each link until none remain. Notice that when removing the link, a replace variable is prepared. Some links might contain a question mark, a valid character in a URL that separates a filename from form variables. Since this character has special meaning to regular expressions, the script places a backslash before it to let PHP know it's to be taken literally. I frequently use ereg_replace to convert text for use in a new context. You can use ereg_replace to collapse multiple spaces into a single space. Listing 22.5 demonstrates this idea. The output is shown in Figure 22.4. Listing 22.5 Replacing multiple spaces
<?php
/*
** if text submitted show it
*/
if(isset($_REQUEST['text']))
{
print("<b>Unfiltered</b><br>\n" .
"<pre>{$_REQUEST['text']}</pre>" .
"<br>\n");
$_REQUEST['text'] = ereg_replace("[[:space:]]+",
" ", $_REQUEST['text']);
print("<b>Filtered</b><br>\n" .
"<pre>{$_REQUEST['text']}</pre>" .
"<br>\n");
}
else
{
$_REQUEST['text'] = "";
}
//start form
print("<form action=\"{$_SERVER['PHP_SELF']}\">\n" .
"<textarea name=\"text\" cols=\"40\" rows=\"10\">" .
"{$_REQUEST['text']}</textarea><br>\n" .
"<input type=\"submit\">\n" .
"</form>\n");
?>
Figure 22.4. Output from Listing 22.5.
|
| [ Team LiB ] |
|