Toolkit
Addition
Robot Exclusions
A group
of Internet users who recognized the problems that unruly robots
could cause got together by e-mail and developed an unofficial standard
known as the Standard for Robot Exclusion (SRE). The SRE defines
a protocol that permits site managers to exclude robots from designated
areas of their Web sites.
This tool
is used to tell search engines "Don't Go There." One very pragmatic
use for the SRE is to prevent robots that are run by search sites
from indexing temporary HTML documents that probably won't be around
this time tomorrow. Another use for the SRE is to allow robots to
steer clear of pages that are under construction or for local usage
or to leave instructions for them to avoid a site altogether.
Using the
SRE to exclude robots from all or part of a Web site is simplicity
itself. All you do is create a file on the Web server and call it
robots.txt. This is a text file containing English-language commands
spelling out access policies for robots.
What does
a robots.txt file look like? Here's a simple one that asks all robots
to stay away from /phone/ and its subdirectories:
# Sample robots.txt file
User-agent: *
Disallow: /phone/
The first line
is a comment line. The second line designates the robots to which
the access policies apply; "*" means all robots. The third line disallows
access to the specified directory and to any directories below it
in the hierarchy.
In other
words, the file "robots.txt" will always tell you the directories
that aren't in a search engine!
To find
these files, use FTP and start searching.
|
Search Tips
Search
Tip: Nesting
Another invaluable
search tool is Nesting. Nesting will enable you set up even more
complex logic statements than you could get using just plain old
keyword searches. Here are two examples of increasingly more complex
queries using Alta Vista and a combination of several operators
to create a complex query.
Nesting
operators can be used to create highly complex queries ...
intel
+(+objective +unix)
that become
more precise as you specify more parameters.
(+objective
+unix +skills) + link:intel Please note that url:+intel +(+objective
+unix) is not a valid expression, because the search engine
would look for all URLs that contain the expression +intel instead
of intel.
|