UNH Search FAQ:
What is not indexed.
author and contact: cwis.admin@unh.edu
updated 23-Aug-1999
Here are reasons, ordered roughly
from more likely to less likely,
that you may be able to
view a file with your browser but not find it
indexed in Infoseek or other search engine
collections:
- Excluded by server sysadmin.
-
To conserve system resources and to reduce the
number of false positives returned as matches,
the server sysadmin develops a collection
management policy that excludes some servers.
Check here for the
disallowed list for Infoseek on
the UNH Intranet.
- Dynamic rather than static page.
-
On some servers most of the Web pages are
dynamic, i.e., produced from a database or
other application to meet a specific request,
and they do not exist as standing or static
HTML files. The whole concept of a page
becomes somewhat different because there may
be a very large number of different
dynamic pages that could be generated in
response to requests. This is sometimes
called the
invisible Web.
URLs for database access often contain "?"
symbols in them and Infoseek does not
index such URLs (nor do any of the other
major search engines).
- Use of robots.txt files.
-
Web server administrators can place a file
called
"robots.txt"
at the top level of their
server and use it to tell the search engine
spider that part or all of the information
(whole directory paths, not individual files)
is off limits. Well-behaved spiders will
then ignore those parts.
- Unusual file format.
-
Infoseek indexes about a dozen formats,
including HTML, plain text, PDF, PostScript,
RTF, Word, Excel, and PowerPoint, but
it does not index the content of graphics,
sounds, and
other unusual formats that may depend on
a special plug-in or application for
browser display. To emphasize, this
applies to content. A work-around where
HTML syntax allows, is to provide an
ALT= attribute with descriptive text,
which
will be indexed.
<img src="keyboard.gif" alt="keyboard graphic">
- Use of domain or password restrictions.
-
Many Web servers support the use of either
domain restrictions or user name and password
restrictions. One convention is the use of
.htaccess files.
- Excluded by page author.
-
Infoseek supports
several conventions that are under
the control of Web authors
to limit part or all of a page
from being indexed.
This includes the use of an HTML
comments convention and support
of robot exclusion via META tags.
- HTML syntax mistakes.
-
Search engines parse out HTML files,
similar to the way your browser does. But the
search engine may not be forgiving of
mistakes that your browser allows
to go by.
If in doubt about a page, as either
author or reader, you can run it
through a
syntax checker.
- Cut-off limit on full-text.
-
While in princple and general practice
search engines offer full-text
indexing, operationally there is normally
an upper limit, beyond which it stops
indexing a file.
For Infoseek on the UNH Intranet
that cut-off is currently one megabyte
and for Webinator with Pubpages it
is 90K bytes.
- HTML Frames.
-
Infoseek can parse and follow HTML frameset
pages, but that is not true of all search
engines. There is also an issue of context
if someone retrieves and views a framed page.
Anyone authoring framed pages should
read the discussion
at Search Engine Watch.
Return to
FAQ for Search Engine Use at UNH.
http://www.unh.edu/NIS/Docs/Search/not-there.html
|