|
Searching the Web.
In the early days of the web (late 1993) there were
very few servers and thus it was relatively easy to find information, only a
year later the number of servers had increased from 200 to 1500 (Berlin 1996).
Finding something useful was now more difficult and required much more effort.
Since then, with the exponential growth of the web, unless the user knows exactly
what they are looking for and where it is they would be very unlikely to find
it in any reasonable time, if at all. Hence the importance of Search Engines,
such as AltaVista, Excite, Yahoo etc. these are the ‘indexes’ of the web.
1 Search Engines.
- Search Engines: Crawl the web, then people search through what they have
found.
- Directories: A directory such as Yahoo depends on humans for its listings.
The user submits a short description to the directory for their entire site,
or editors write one for sites they review. A search looks for matches only
in the descriptions submitted.
- Hybrid Search Engines: Some search engines maintain an associated directory.
Being included in a search engine's directory is usually a combination of
luck and quality.
Sometimes a site can be "submited" for review, but there is no guarantee
that it will be included. Reviewers often keep an eye on sites submitted to
announcement places, then choose to add those that look appealing.
1.1 Parts Of A Search Engine.
Search engines have three major elements;
2 The Major Search Facilities.
2.1Yahoo (www.yahoo.com).
Yahoo
is one of the oldest and most visited sites and is currently positioning itself
as a portal site. It is a directory service organised via a hierarchical system
and is probably the most popular and important search service available (Sullivan(2)
1998).
Because of the human role, directories can often provide
better results than search engines. Information can be found by browsing the
lists or by using the search facility to find a listing for keyword/s typed
in. If no results are found in the Yahoo listing then it will automatically
default to results from AltaVista, whilst offering the user links to other search
services. Yahoo also provides regional directories tailored to certain countries
e.g. Yahoo UK for the UK and Yahoo Germany for Germany.
2.2
AltaVista (www.altavista.com).
This is a fully-fledged search engine, it constantly visits web sites on the
Internet in order to create catalogues of web pages. Because they run automatically
and index so many web pages, search engines may often find information not listed
in directories (in the early stages of its development Student.Machester was
found listed at AltaVista, it had not been submitted).
Originally set up by Digital (now owned by Compaq)
to showcase the speed of its Alpha processors and Servers (Keogh 1998). It has
since become one of the first stop sites of the web. AltaVista uses a slightly
different technique in storing its index than most other engines. Once a site
is indexed words are referenced according to code numbers, for instance ‘the’
could be referenced as 1, this cuts down on the memory required to store site
information and speeds up the search process. Pages may be indexed here by
either submitting the URL, to trigger its ‘spider’ and the page will appear
within a few days or just waiting for a visit by its ‘spider’.
2.3 Excite (www.excite.com).
Launched
in late 1995, Excite grew quickly and has since eaten two of its competitors.
In July 1996, Excite purchased Magellan and in November 1996, it acquired WebCrawler.
These continue to run as separate services.
Excite lists sites in one of three ways; Excite Search,
Channels By Excite and Excite NewsTracker;
- Excite Search taps into the traditional search engine listings, created
from crawling the web.
- Channels By Excite lists sites by topics. These sites have been approved
by editors, and sometimes also have reviews. There is also much associated
subject information, discussion areas and more.
- Excite NewsTracker allows you so search only listings generated by crawling
speciality news sites.
2.4 Lycos (www.lycos.com).

Around since May 1994, Lycos is one of the oldest
of the major search engines. It began as a project at Carnegie Mellon University.
The name Lycos comes from the Latin for "wolf spider." Lycos lists
sites in two main ways; The first is the standard search engine listing and
there is an associated directory called "Web Guides."
Lycos will index text in meta tags, but it won’t use
them to display a description. Descriptions are created from the first 275
characters following the <BODY> tag.
2.5 Infoseek (www.infoseek.com).
Around since early 1995, Infoseek is well known, well
reviewed and well connected. The old "Infoseek Guide" index only
had about 1 to 2 million URLs catalogued. In the autumn of 1996, the new service
with 50 million URLs was introduced. Infoseek also runs a separate directory
where sites are listed by topic, which are automatically generated using categorisation
software.
Infoseek will list any page submitted to it within a few
minutes and up to 50 can be submitted per day. Though it dislikes text that’s
too small to read, pages that jump to other pages or text made invisible by
using a font colour the same as the background colour.
2.6 A Comparison of Search Engines.
Table 2.6-1. Comparison of features of the Search Engines
listed (after Sullivan(1) 1998).
| |
AltaVista |
Excite |
Lycos |
Infoseek |
|
| Factors that affect if a page is indexed or not. |
| Size |
Big: 140 |
Big: 55 |
Medium: 30 |
Medium: 30 |
Approximate number of pages (in millions). |
| Freshness |
1 day-1 month |
1-3 weeks |
1-2 weeks |
1 day-2 months |
How old a listing will be. |
| Depth |
No Limit |
No Limit |
Sample |
Sample |
No Limit: Lists as many pages as possible.Sample:
Lists a selection of pages from a site. |
| Frames Support |
4 |
|
|
|
Does it follow Frame links. |
| Imagemaps |
4 |
|
|
4 |
Does it follow links in client-side Image
maps? |
| Link Popularity |
|
4 |
4 |
|
Does it rank according to how many pages
link to it? |
| Learns Frequency |
4 |
|
|
4 |
Pages that change frequently get more visits? |
| Meta Robots |
4 |
|
4 |
4 |
Does it support Meta tags that specify
if it can be indexed of not? |
| Factors that affect page ranking. |
| Meta Tags |
|
|
|
4 |
Does it boost ranking if search terms used
in Meta tags? |
| Meta Refresh |
SPAM |
OK |
OK |
SPAM |
Does it index pages that use some form
of redirection? |
| Invisible text |
SPAM |
OK |
SPAM |
SPAM |
Does it index a page where there is text
the same colour as the background? |
| Tiny text |
SPAM |
OK |
SPAM |
OK |
Does it index pages where much of the text
is in a small font size? |
| ALT text |
4 |
|
4 |
4 |
Does it index the ALT text of images? |
SPAM: All major search engines penalise sites that
attempt to ‘Spam’ them, in order to improve their positions. One common technique
is to ‘stacking’ or ‘stuffing’ words on a page: this is a technique where a
word is repeated many times in a row. If the search engine spots a spamming
technique it may downgrade its’ ranking or exclude it from listings altogether.
NB The term SPAM originates from a Monty Python
sketch.
3 Improving Search Results.
Once a web site has been designed and is ready to
‘go live’ on the web, the major task facing the designer is ‘How will people
find the site?” The two major methods are:
- By entering via another (related) site that links to it.
- Finding it via one of the search engines.
The first part is relatively simple, all the designer has to do is contact
other related (if it isn’t relevant in some way then why would anyone follow
the link) sites and establish a reciprocal linking arrangement, whereby each
site links to the other. This way traffic from one site can be shared with
the other.
The second method not only involves submitting the site to the various search
engines, but to design it so that it is displayed more prominently when a search
engine user is looking for it i.e. to improve the site’s search ranking.
3.1 Effective Labels.
An important aspect to be considered during the design
stages is how pages should be designed to best take advantage of the various
search engines so that they obtain better rankings with search results. Probably
the best place to start would be in the <TITLE> tag, here words should
be chosen that best reflect the content of that page e.g. Clubs, Nightclubs.
The next port of call would now be the main body of the page. Some engines
will produce a better ranking according the number of times certain keywords
are repeated, though avoid too much use of keywords as some engines will ‘filter’
out pages they consider to be ‘spamming’ them.
Some search engines (such as Lycos) look at the first
275 characters following the <BODY> tag and index a description based
on that. One way around this would be to include all the descriptive text in
an <ALT> tag of an image one pixel in size at the top or have it displayed
in the same colour as the background, though some engines will ignore these
(Infoseek). The most important thing is to have the main search terms appear
in the title and the first few paragraphs. Additionally it may be useful to
change the Title regularly so that when robots revisit your site to refresh
their information, they will interpret the new title to mean the existence of
a new site. The result being your page will be listed more than once in a search
(Submit It! 1998).
Also there could be problems for a site whose initial
page contains frames or imagemaps, as currently only AltaVista supports frames
and AltaVista and Infoseek support imagemaps (Sullivan(1) 1998). Workarounds
involve the use of the <NOFRAMES> tag to include descriptions and links
(also useful for browsers that do not support frames) and for imagemaps, the
use of a text menu is advised (Fig. 3.1-1).
Going back to the idea of reciprocal links,
it is useful to know that many search engines will give a higher ranking to
a Website that is linked to by many others, this is known as ‘link popularity’
(Sullivan(1) 1998).
3.2 META Tags.
META tags should be placed in the head of the HTML document, between the <HEAD>
and </HEAD> tags (especially important in documents using FRAMES). They
have two possible attributes;
<META HTTP-EQUIV="name" CONTENT="content">
<META NAME="name" CONTENT="content">
META tags with an HTTP-EQUIV attribute are equivalent to HTTP headers. Typically,
they control the action of browsers, and may be used for browser redirects,
setting the documents’ target, setting cookies etc.
META tags with a NAME attribute are used for other types, which do not correspond
to HTTP headers. These attributes can be used to;
Specify the author e.g.
|
13
|
<META Name="author" Content="Martin Allen - martin.allen@excite.co.uk
- http://members.tripod.com/~MJA_ENT/index.htm">
|
Specify "keyword" and "description" attributes. These allow
the search engines to easily index pages using keywords specified, along with
a description of the site. The description attribute is used to specify a description
of the site and the keywords attribute is used to tell the search engines which
keywords to use e.g.
|
10
|
<Meta name="description" content="An Interactive
Guide to the City of Manchester">
|
|
12
|
<META name="keywords" content="Manchester,Manchester
guide,Student.Manchester,student guide,student,students,bars,clubs,nightclubs,food,
eats,cinemas, music">
|
There are some parts of the site that shouldn’t be indexed by the spiders at
all e.g. a page that is part of a frameset. The robots META attribute was designed
with this problem in mind;
<META NAME="robots" CONTENT="all | none
| index | noindex | follow | nofollow">
The default for the robot attribute is "all". This allows all of
the files to be indexed. None tells the spider not to index any files, and
not to follow the hyperlinks on the page to other pages. index indicates that
this page may be indexed by the spider, while follow would mean that the spider
is free to follow the links from this page to other pages. The inverse is also
true, thus this META tag;
<META NAME="robots" CONTENT=" noindex">
Tells the spider not to index this page, but would allow it to follow subsidiary
links and index those pages. nofollow allows the page itself to be indexed,
but the links could not be followed (Clark(2) 1998). An example of this in
use would be;
|
11
|
<META name="robots" content="all">
|
Taken from APPENDIX B INDEX.HTM.
4 The Student.Manchester Search Engine.
As Student.Manchester is a relatively small site it was
decided to implement a simple sequential search script (searches every file),
as opposed to an index (where files to be searched are cross referenced to keywords)
written in Perl 5.0 running on a UNIX environment (an NCSA server). In order
to ‘not re-invent the wheel’ a ready-made script was found on the web (Wright
1998, also APPENDIX B MSEARCH.PL) and adapted for use at Student.Manchester.
The requirements were for a simple search facility available at all times via
the main interface, with the option of a more advanced form of searching available
after the first search.
4.1 The User Interface.
Typing in a keyword and clicking the
Search button submits the query to the server whilst simultaneously spawning
a new window containing two frames (Fig. 4.1-2).
Whilst the server is processing the search query the top
frame of the new window displays the message “Searching Please Wait.” Use has
been made of the much maligned (and rightly so) <BLINK> tag for the “Please
Wait” part of the message. This is justified on the grounds that in this case
the blinking text gives the impression that the computer is actually doing something
and serves to reassure the user (this was proved in subsequent evaluations).
Once a search has been processed the results are displayed in the top frame
of the window (Fig. 4.1-3), clicking on a result displays the required document
in the bottom frame of the window. It is thus possible for a user to go through
the entire site without using the main interface. The search results display
is organised in such a way as to be intuitive to use and thus enhance the user
experience.
Utilising the option “New Search” changes the results display to that of the Advanced
Search interface (Fig. 4.1-4).
4.2 Processing the Search.
After a search of the web a simple search script was found
at Matt’s Script Archive (Wright 1998). Three versions of this script are used
at Student.Manchester, the first (SEARCH.PL) is run when the Search button is
clicked from the main interface, the second (MSEARCH.PL: APPENDIX B) is run
when the button is clicked from the Advanced Search interface and the third
(LSEARCH.PL) is run from the Student.Manchester:Lite search interface.
SEARCH.PL is a cut down version of the original script,
the sections dealing with Case-Sensitive and Boolean AND searches have been
deleted. Whilst LSEARCH.PL is optimised for use in Student.Manchester:Lite
part of the site All three have had the original scripts’ code changed to improve
the search results:
- To avoid producing results for navigation windows, the script ignores any
file not containing “name=SEARCHME” in its Meta tags, e.g.
|
09
|
<META name=SEARCHME content=yes>
|
Using the line;
|
95
|
if (($string =~ /name=SEARCHME/)) {
|
This allows the designer to specify which files can be found via a search of
the site.
- Another change is the addition of code in the search script that recognises
that nothing has matched the search query and produces a message (Fig. 4.2-1)
Code section taken from APPENDIX B MSEARCH.PL
|
161
|
$FLAG='NO';
|
|
162
|
foreach $key (keys %include) {
|
|
163
|
if ($include{$key} eq 'yes') {
|
|
164
|
$FLAG='YES';
|
|
165
|
}
|
|
166
|
}
|
|
167
|
|
|
168
|
if ($FLAG eq 'NO') {
|
|
169
|
print "<center>\n";
|
|
170
|
print "<h4>Sorry</h4><!--br-->\n";
|
|
171
|
print "There was no match to your search query.<br>\n";
|
|
172
|
print "<!--Please try again.<br-->\n";
|
|
173
|
print "</center>\n";
|
|
174
|
}
|
The code sets an initial flag to NO, if any match
is found then the flag is set to YES. If the flag is still set to NO after
all the files have been searched then the message is displayed.
If more time were available the search script would have been adapted
to ignore words found within HTML tags, except for those within Meta tags.
|