| [ Team LiB ] |
|
22.1 TokenizingPHP allows for a simple model for tokenizing a string. Certain characters, of your choice, are considered separators. Strings of characters between separators are considered tokens. You may change the set of separators with each token you pull from a string, which is handy for irregular strings—that is, ones that aren't simply comma-separated lists. Listing 22.1 accepts a sentence and breaks it into words using the strtok function, described in Chapter 12. As far as the script is concerned, a word is surrounded by a space, punctuation, or either end of the sentence. Single and double quotes are left as part of the word. Output is shown in Figure 22.1. Listing 22.1 Tokenizing a string
<?php
/*
** If submitted a sentence, parse it
*/
if(isset($_REQUEST['sentence']))
{
$total=0;
print("<b>Submitted text:</b>");
print("{$_REQUEST['sentence']}<br>\n<br>\n");
//set characters that separate tokens
$separators = " ,!.?";
//get each token
for($token = strtok($_REQUEST['sentence'], $separators);
$token !== FALSE;
$token = strtok($separators))
{
//skip empty tokens
if($token != "")
{
// count each word
if(!isset($word_count[strtolower($token)]))
{
$word_count[strtolower($token)]=1;
}
else
{
$word_count[strtolower($token)]++;
}
$total++;
}
}
//first sort by word
ksort($word_count);
//next sort by frequency
arsort($word_count);
print("<b>$total Words Found</b>\n");
print("<ul>\n");
foreach($word_count as $key=>$value)
{
print("<li>$key ($value)</li>\n");
}
print("</ul>\n");
}
print("<form action=\"{$_SERVER['PHP_SELF']}\" " .
"method=\"post\">\n");
print("<input name=\"sentence\" size=\"40\">\n");
print("<input type=\"submit\" value=\"Parse\">\n");
print("</form>\n");
?>
Figure 22.1. Output from Listing 22.1.
Note the use of the for loop in this example. Instead of incrementing an integer, it gets tokens, one by one. When strtok encounters the end of input, it returns FALSE. Your first inclination might be to test for FALSE in the for loop with the != operator. Recall that an empty string is considered equivalent to FALSE. If two separators follow each other, strtok will return an empty string, as you'd expect. Since we don't want to stop tokenizing at the first repeated separator, we must check for a genuine FALSE with the !== operator. The strtok function is useful only in the most simple and structured situations. An example might be reading a tab-delimited text file. The algorithm might be to read a line from a file, pulling each token from the line using the tab character, then continuing by getting the next line from the file. |
| [ Team LiB ] |
|