[ Team LiB ] Previous Section Next Section

27.9 URLs Friendly to Search Engines

Search engines such as Google <http://www.google.com/> and All the Web <http://www.alltheweb.com/> attempt to explore the entire Web. They have become an essential resource for Internet users, and anyone who maintains a public site benefits from being listed. Search engines use robots, or spiders, to explore pages in a Web site, and they index PHP scripts the same way they index HTML files. When links appear in a page, they are followed. Consequently, the entire site becomes searchable.

Unfortunately, many robots do not follow links that appear to contain form variables. Links containing question marks may lead a robot into an endless loop, so they are programmed to avoid them. This presents a problem for sites that use form variables in links. Passing form variables in anchor tags is a natural way for PHP to communicate, but it can keep your pages out of the search engines. To overcome this problem, data must be passed in a format that resembles ordinary URLs.

First, consider how a Web server accepts a URI and matches it to a file. The URI is a virtual path, the part of the URL that comes after the hostname. It begins with a slash and may be followed by a directory, another slash, and so forth. One by one, the Web server matches directories in the URI to directories in the file system. A script is executed when it matches part of the URI, even when more path information follows. Ordinarily, this extra path information is thrown away, but you can capture it.

Look at Listing 27.9. This script works with Apache compiled for UNIX but may not work with other Web servers. It relies on the PATH_INFO environment variable, which may not be present in a different context. Each Web server creates a unique set of environment variables, although there is overlap.

Listing 27.9 Using path info
<?php
    if(isset($_SERVER['PATH_INFO']))
    {
        //remove .html from the end
        $path = str_replace(".html",
            "", $_SERVER['PATH_INFO']);

        //remove leading slash
        $path = substr($path, 1);

        //iterate over parts
        $pathVar = array();
        $v = explode("/", $path);
        $c = count($v);
        for($i=0; $i<$c; $i += 2)
        {
            $pathVar[($v[$i])] = $v[$i+1];
        }


        print("You are viewing message " .
            "{$pathVar['message']}<br>\n");
    }

    //pick a random ID
    $nextID = rand(1, 1000);
    print("<a href=\"{$_SERVER["SCRIPT_NAME"]}/message/
        $nextID.html\">" .
        "View Message $nextID</a><br>\n");
?>

You may be accessing the code in Listing 27.9 from the URL http://localhost/corephp/27-9.php/message/1234.html. In this case, you are connecting to a local server that contains a directory named corephp in its document root. A default installation of Apache might place this in /usr/local/apcache/htdocs. The name of the script is 27-9.php, and everything after the script name is then placed in the PATH_INFO variable. No file named 1234.html exists, but to the Web browser it appears to be an ordinary HTML document. It appears that way to a spider as well.

The code in Listing 27.9 doesn't really do much. It splits the path info into pairs used for variable name and value. The script pretends message is an identifier. It could be referencing a record in a relational database. I've added some code to use a random number to create a link to another imaginary record. Remember the BBS from Chapter 23? This method could be applied, and each message would appear to be a single HTML file.

I've introduced only the essential principles of this method. There are a few pitfalls, and there are a few enhancements to be pursued. Keep in mind that Web browsers do their best to fill in relative URLs, and using path information this way may foil their attempts to request images that appear in your scripts. Therefore, you must use absolute paths. You might also wish to name your PHP script so that it doesn't contain an extension. This is possible with Apache by setting the default document type, using the DefaultType configuration directive. You can also use Apache's mod_rewrite. I encourage you to read about these parts of Apache at its home site <http://www.apache.org/docs/>.

    [ Team LiB ] Previous Section Next Section