(unh intranet faq file logo)
UNH Search FAQ:
What is not indexed.
author and contact: cwis.admin@unh.edu
updated 23-Aug-1999


Here are reasons, ordered roughly from more likely to less likely, that you may be able to view a file with your browser but not find it indexed in Infoseek or other search engine collections:

Excluded by server sysadmin.
To conserve system resources and to reduce the number of false positives returned as matches, the server sysadmin develops a collection management policy that excludes some servers. Check here for the disallowed list for Infoseek on the UNH Intranet.

Dynamic rather than static page.
On some servers most of the Web pages are dynamic, i.e., produced from a database or other application to meet a specific request, and they do not exist as standing or static HTML files. The whole concept of a page becomes somewhat different because there may be a very large number of different dynamic pages that could be generated in response to requests. This is sometimes called the invisible Web. URLs for database access often contain "?" symbols in them and Infoseek does not index such URLs (nor do any of the other major search engines).

Use of robots.txt files.
Web server administrators can place a file called "robots.txt" at the top level of their server and use it to tell the search engine spider that part or all of the information (whole directory paths, not individual files) is off limits. Well-behaved spiders will then ignore those parts.

Unusual file format.
Infoseek indexes about a dozen formats, including HTML, plain text, PDF, PostScript, RTF, Word, Excel, and PowerPoint, but it does not index the content of graphics, sounds, and other unusual formats that may depend on a special plug-in or application for browser display. To emphasize, this applies to content. A work-around where HTML syntax allows, is to provide an ALT= attribute with descriptive text, which will be indexed.
<img src="keyboard.gif" alt="keyboard graphic">

Use of domain or password restrictions.
Many Web servers support the use of either domain restrictions or user name and password restrictions. One convention is the use of .htaccess files.

Excluded by page author.
Infoseek supports several conventions that are under the control of Web authors to limit part or all of a page from being indexed. This includes the use of an HTML comments convention and support of robot exclusion via META tags.

HTML syntax mistakes.
Search engines parse out HTML files, similar to the way your browser does. But the search engine may not be forgiving of mistakes that your browser allows to go by. If in doubt about a page, as either author or reader, you can run it through a syntax checker.

Cut-off limit on full-text.
While in princple and general practice search engines offer full-text indexing, operationally there is normally an upper limit, beyond which it stops indexing a file. For Infoseek on the UNH Intranet that cut-off is currently one megabyte and for Webinator with Pubpages it is 90K bytes.

HTML Frames.
Infoseek can parse and follow HTML frameset pages, but that is not true of all search engines. There is also an issue of context if someone retrieves and views a framed page. Anyone authoring framed pages should read the discussion at Search Engine Watch.


Return to FAQ for Search Engine Use at UNH.
http://www.unh.edu/NIS/Docs/Search/not-there.html