HESPERIAN : the World @web
||  home : : careerZONE : : travelZone : : gallery : : archives : : WHITEPAPERS  ||

 
Searching the Web.
[Chapter 7 of MSc Dissertation]

In the early days of the web (late 1993) there were very few servers and thus it was relatively easy to find information, only a year later the number of servers had increased from 200 to 1500 (Berlin 1996).  Finding something useful was now more difficult and required much more effort.  Since then, with the exponential growth of the web, unless the user knows exactly what they are looking for and where it is they would be very unlikely to find it in any reasonable time, if at all.  Hence the importance of Search Engines, such as AltaVista, Excite, Yahoo etc. these are the ‘indexes’ of the web.

1 Search Engines.

  • Search Engines: Crawl the web, then people search through what they have found.
  • Directories: A directory such as Yahoo depends on humans for its listings. The user submits a short description to the directory for their entire site, or editors write one for sites they review.  A search looks for matches only in the descriptions submitted.
  • Hybrid Search Engines: Some search engines maintain an associated directory.  Being included in a search engine's directory is usually a combination of luck and quality.

Sometimes a site can be "submited" for review, but there is no guarantee that it will be included.  Reviewers often keep an eye on sites submitted to announcement places, then choose to add those that look appealing.

1.1 Parts Of A Search Engine.

Search engines have three major elements;

  •   First is the spider, also called the crawler.  The spider visits a web page, reads it, and then follows links to other pages within the site.  The spider returns to the site on a regular basis, such as every month or two, to look for changes.
  • Everything the spider finds goes into the second part of a search engine, the index.  The index, sometimes called the catalogue, is like a giant book containing a copy of every web page that the spider finds.  If a web page changes, then this book is updated with the new information.

    Sometimes it can take a while for new pages or changes that the spider finds to be added to the index.  Thus, a web page may have been "spidered" but not yet "indexed."  Until it is indexed (added to the index) it is not available to those searching with the search engine.

  • Search engine software is the third part of a search engine.  This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant.

2 The Major Search Facilities.

2.1Yahoo (www.yahoo.com).

Yahoo is one of the oldest and most visited sites and is currently positioning itself as a portal site.  It is a directory service organised via a hierarchical system and is probably the most popular and important search service available (Sullivan(2) 1998).  

Because of the human role, directories can often provide better results than search engines.  Information can be found by browsing the lists or by using the search facility to find a listing for keyword/s typed in.  If no results are found in the Yahoo listing then it will automatically default to results from AltaVista, whilst offering the user links to other search services.  Yahoo also provides regional directories tailored to certain countries e.g. Yahoo UK for the UK and Yahoo Germany for Germany.

 
2.2 AltaVista (www.altavista.com).

This is a fully-fledged search engine, it constantly visits web sites on the Internet in order to create catalogues of web pages. Because they run automatically and index so many web pages, search engines may often find information not listed in directories (in the early stages of its development Student.Machester was found listed at AltaVista, it had not been submitted).

Originally set up by Digital (now owned by Compaq) to showcase the speed of its Alpha processors and Servers (Keogh 1998). It has since become one of the first stop sites of the web.  AltaVista uses a slightly different technique in storing its index than most other engines.  Once a site is indexed words are referenced according to code numbers, for instance ‘the’ could be referenced as 1, this cuts down on the memory required to store site information and speeds up the search process.  Pages may be indexed here by either submitting the URL, to trigger its ‘spider’ and the page will appear within a few days or just waiting for a visit by its ‘spider’.

2.3 Excite (www.excite.com).

Launched in late 1995, Excite grew quickly and has since eaten two of its competitors. In July 1996, Excite purchased Magellan and in November 1996, it acquired WebCrawler. These continue to run as separate services.

Excite lists sites in one of three ways; Excite Search, Channels By Excite and Excite NewsTracker;

  • Excite Search taps into the traditional search engine listings, created from crawling the web.
  • Channels By Excite lists sites by topics. These sites have been approved by editors, and sometimes also have reviews. There is also much associated subject information, discussion areas and more.
  • Excite NewsTracker allows you so search only listings generated by crawling speciality news sites.
2.4 Lycos (www.lycos.com).

Around since May 1994, Lycos is one of the oldest of the major search engines. It began as a project at Carnegie Mellon University. The name Lycos comes from the Latin for "wolf spider."  Lycos lists sites in two main ways; The first is the standard search engine listing and there is an associated directory called "Web Guides."

Lycos will index text in meta tags, but it won’t use them to display a description.  Descriptions are created from the first 275 characters following the <BODY> tag.

2.5 Infoseek (www.infoseek.com).

Around since early 1995, Infoseek is well known, well reviewed and well connected.  The old "Infoseek Guide" index only had about 1 to 2 million URLs catalogued.  In the autumn of 1996, the new service with 50 million URLs was introduced.  Infoseek also runs a separate directory where sites are listed by topic, which are automatically generated using categorisation software.

Infoseek will list any page submitted to it within a few minutes and up to 50 can be submitted per day.  Though it dislikes text that’s too small to read, pages that jump to other pages or text made invisible by using a font colour the same as the background colour.

2.6 A Comparison of Search Engines.

Table 2.6-1. Comparison of features of the Search Engines listed (after Sullivan(1) 1998).

  AltaVista Excite Lycos Infoseek  
Factors that affect if a page is indexed or not.
Size Big: 140 Big: 55 Medium: 30 Medium: 30 Approximate number of pages (in millions).
Freshness 1 day-1 month 1-3 weeks 1-2 weeks 1 day-2 months How old a listing will be.
Depth No Limit No Limit Sample Sample No Limit: Lists as many pages as possible.Sample: Lists a selection of pages from a site.
Frames Support 4       Does it follow Frame links.
Imagemaps 4     4 Does it follow links in client-side Image maps?
Link Popularity   4 4   Does it rank according to how many pages link to it?
Learns Frequency 4     4 Pages that change frequently get more visits?
Meta Robots 4   4 4 Does it support Meta tags that specify if it can be indexed of not?
Factors that affect page ranking.
Meta Tags       4 Does it boost ranking if search terms used in Meta tags?
Meta Refresh SPAM OK OK SPAM Does it index pages that use some form of redirection?
Invisible text SPAM OK SPAM SPAM Does it index a page where there is text the same colour as the background?
Tiny  text SPAM OK SPAM OK Does it index pages where much of the text is in a small font size?
ALT text 4   4 4 Does it index the ALT text of images?


SPAM: All major search engines penalise sites that attempt to ‘Spam’ them, in order to improve their positions.  One common technique is to ‘stacking’ or ‘stuffing’ words on a page: this is a technique where a word is repeated many times in a row.  If the search engine spots a spamming technique it may downgrade its’ ranking or exclude it from listings altogether.

NB The term SPAM originates from a Monty Python sketch.

3 Improving Search Results.

Once a web site has been designed and is ready to ‘go live’ on the web, the major task facing the designer is ‘How will people find the site?”   The two major methods are:

  • By entering via another (related) site that links to it.
  • Finding it via one of the search engines.

 The first part is relatively simple, all the designer has to do is contact other related (if it isn’t relevant in some way then why would anyone follow the link) sites and establish a reciprocal linking arrangement, whereby each site links to the other.  This way traffic from one site can be shared with the other.

The second method not only involves submitting the site to the various search engines, but to design it so that it is displayed more prominently when a search engine user is looking for it i.e. to improve the site’s search ranking.

3.1 Effective Labels.

An important aspect to be considered during the design stages is how pages should be designed to best take advantage of the various search engines so that they obtain better rankings with search results.  Probably the best place to start would be in the <TITLE> tag, here words should be chosen that best reflect the content of that page e.g. Clubs, Nightclubs.  The next port of call would now be the main body of the page.  Some engines will produce a better ranking according the number of times certain keywords are repeated, though avoid too much use of keywords as some engines will ‘filter’ out pages they consider to be  ‘spamming’ them.

Some search engines (such as Lycos) look at the first 275 characters following the <BODY> tag and index a description based on that.  One way around this would be to include all the descriptive text in an <ALT> tag of an image one pixel in size at the top or have it displayed in the same colour as the background, though some engines will ignore these (Infoseek).  The most important thing is to have the main search terms appear in the title and the first few paragraphs.  Additionally it may be useful to change the Title regularly so that when robots revisit your site to refresh their information, they will interpret the new title to mean the existence of a new site.  The result being your page will be listed more than once in a search (Submit It! 1998).

Also there could be problems for a site whose initial page contains frames or imagemaps, as currently only AltaVista supports frames and AltaVista and Infoseek support imagemaps (Sullivan(1) 1998).  Workarounds involve the use of the <NOFRAMES> tag to include descriptions and links (also useful for browsers that do not support frames) and for imagemaps, the use of a text menu is advised (Fig. 3.1-1).


Going back to the idea of reciprocal links, it is useful to know that many search engines will give a higher ranking to a Website that is linked to by many others, this is known as ‘link popularity’ (Sullivan(1) 1998).

3.2 META Tags.

META tags should be placed in the head of the HTML document, between the <HEAD> and </HEAD> tags (especially important in documents using FRAMES).  They have two possible attributes;

<META HTTP-EQUIV="name" CONTENT="content">

<META NAME="name" CONTENT="content">

META tags with an HTTP-EQUIV attribute are equivalent to HTTP headers.  Typically, they control the action of browsers, and may be used for browser redirects, setting the documents’ target, setting cookies etc.

META tags with a NAME attribute are used for other types, which do not correspond to HTTP headers.  These attributes can be used to;

Specify the author e.g.

13

<META Name="author" Content="Martin Allen - martin.allen@excite.co.uk - http://members.tripod.com/~MJA_ENT/index.htm">

Specify "keyword" and "description" attributes. These allow the search engines to easily index pages using keywords specified, along with a description of the site.  The description attribute is used to specify a description of the site and the keywords attribute is used to tell the search engines which keywords to use e.g.

10

<Meta name="description" content="An Interactive

Guide to the City of Manchester">

12

<META name="keywords" content="Manchester,Manchester guide,Student.Manchester,student guide,student,students,bars,clubs,nightclubs,food,

eats,cinemas, music">

There are some parts of the site that shouldn’t be indexed by the spiders at all e.g. a page that is part of a frameset.  The robots META attribute was designed with this problem in mind;

<META NAME="robots" CONTENT="all | none | index | noindex | follow | nofollow">

The default for the robot attribute is "all".  This allows all of the files to be indexed.  None tells the spider not to index any files, and not to follow the hyperlinks on the page to other pages.  index indicates that this page may be indexed by the spider, while follow would mean that the spider is free to follow the links from this page to other pages.  The inverse is also true, thus this META tag;

<META NAME="robots" CONTENT=" noindex">

Tells the spider not to index this page, but would allow it to follow subsidiary links and index those pages.  nofollow allows the page itself to be indexed, but the links could not be followed (Clark(2) 1998).  An example of this in use would be;

11

<META name="robots" content="all">

Taken from APPENDIX B INDEX.HTM.

4 The Student.Manchester Search Engine.

As Student.Manchester is a relatively small site it was decided to implement a simple sequential search script (searches every file), as opposed to an index (where files to be searched are cross referenced to keywords) written in Perl 5.0 running on a UNIX environment (an NCSA server).  In order to ‘not re-invent the wheel’ a ready-made script was found on the web (Wright 1998, also APPENDIX B MSEARCH.PL) and adapted for use at Student.Manchester.  The requirements were for a simple search facility available at all times via the main interface, with the option of a more advanced form of searching available after the first search.


4.1 The User Interface.

Typing in a keyword and clicking the Search button submits the query to the server whilst simultaneously spawning a new window containing two frames (Fig. 4.1-2).

Whilst the server is processing the search query the top frame of the new window displays the message “Searching Please Wait.”  Use has been made of the much maligned (and rightly so) <BLINK> tag for the “Please Wait” part of the message.  This is justified on the grounds that in this case the blinking text gives the impression that the computer is actually doing something and serves to reassure the user (this was proved in subsequent evaluations).  Once a search has been processed the results are displayed in the top frame of the window (Fig. 4.1-3), clicking on a result displays the required document in the bottom frame of the window.  It is thus possible for a user to go through the entire site without using the main interface.  The search results display is organised in such a way as to be intuitive to use and thus enhance the user experience.


Utilising the option “New Search” changes the results display to that of the Advanced Search interface (Fig. 4.1-4).

4.2 Processing the Search.

After a search of the web a simple search script was found at Matt’s Script Archive (Wright 1998).  Three versions of this script are used at Student.Manchester, the first (SEARCH.PL) is run when the Search button is clicked from the main interface, the second (MSEARCH.PL: APPENDIX B) is run when the button is clicked from the Advanced Search interface and the third (LSEARCH.PL) is run from the Student.Manchester:Lite search interface.

SEARCH.PL is a cut down version of the original script, the sections dealing with Case-Sensitive and Boolean AND searches have been deleted.  Whilst LSEARCH.PL is optimised for use in Student.Manchester:Lite part of the site  All three have had the original scripts’ code changed to improve the search results:

  • To avoid producing results for navigation windows, the script ignores any file not containing  “name=SEARCHME” in its Meta tags, e.g.
09 <META name=SEARCHME content=yes>


Using the line;

95 if (($string =~ /name=SEARCHME/)) {


This allows the designer to specify which files can be found via a search of the site.

  • Another change is the addition of code in the search script that recognises that nothing has matched the search query and produces a message (Fig. 4.2-1)

Code section taken from APPENDIX B MSEARCH.PL

161

$FLAG='NO';

162

foreach $key (keys %include) {

163

if ($include{$key} eq 'yes') {

164

$FLAG='YES';

165

}

166

}

167

 

168

if ($FLAG eq 'NO')  {

169

print "<center>\n";

170

print "<h4>Sorry</h4><!--br-->\n";

171

print "There was no match to your search query.<br>\n";

172

print "<!--Please try again.<br-->\n";

173

print "</center>\n";

174

}

The code sets an initial flag to NO, if any match is found then the flag is set to YES.  If the flag is still set to NO after all the files have been searched then the message is displayed.

If more time were available the search script would have been adapted to ignore words found within HTML tags, except for those within Meta tags.


 
©Martin Allen 2001-2008 All rights reserved :: sitemap :: request_info :: feedback :: view_CV ::