How Search Engines Work

Internet search engines are special sites on the Internet that are
designed to help people find information stored on other sites.
There
are differences in the ways various search engines work, but they all
perform three basic tasks:
-They search the Internet ,or select pieces of the Internet based on important words.
-They keep an index of the words they find, and where they
find them.
-They allow users to look for words or combinations of words
found in that index.
Early search engines held an index of a few hundred thousand
pages and documents, and received maybe one or two thousand
inquiries each day. Today, a top search engine will index hundreds of
millions of pages, and respond to tens of millions of queries per day.
Spidering
Before a search engine can tell you where a file or document is, it
must be found. To find information on the hundreds of millions of web pages that exist, a search engine employs special software robots,
called spiders, to build lists of the words found on websites. 
When a spider is building its lists, the process is called crawling.
In order to build and maintain a  useful list of words, a search
engine's spiders have to look at a  lot of pages. How does any spider
start its travels over the web? The  usual starting points are lists of
heavily used  servers and very popular pages. The spider will begin
with a popular site, indexing the  words on its pages and following
every link found within the site. In this way, the spidering system
quickly begins to travel, spreading  out across the most widely used
portions of the web. 
Indexing
Once the spiders have completed  the task of finding information
on web pages, the search engine must store the information in a way
that makes it useful. There are  two key components involved in
making the gathered data accessible to users: 
-The information stored with the data 
-The method by which the information is indexed
In the simplest case, a search engine could just store the word
and the URL where it was found. In  reality, this would make for an
engine of limited use, since there would be no way of telling whether
the word was used in an important or a trivial way on the page,
whether the word was used once or many times or whether the page
contained links to other pages containing the word. In other words,
there would be no way of building the ranking list that tries to present
the most useful pages at the top of the list of search results. 
To make for more useful results, most search engines store more
than just the word and URL. An  engine might store the number of
times that the word appears on a page. The engine might assign a
weight to each entry, with increasing values assigned to words as they
appear near the top of the document, in sub-headings, in links, in the
meta tags or in the title of the page. Each commercial search engine
has a different formula for assigning weight to the words in its index.
This is one of the reasons that a search for the same word on different
search engines will produce different lists, with the pages presented in
different orders. 
An index has a single purpose: It allows information to be found
as quickly as possible. There are quite a few ways for an index to be
built, but one of the most effective ways is to build a hash table. In hashing, a formula is applied to attach a numerical value to each word.
The formula is designed to evenly distribute the entries across a
predetermined number of divisions. This numerical distribution is
different from the distribution of words across the alphabet, and that is
the key to a hash table's effectiveness. 
The Search Engine Program
The search engine software or program is the final part. When a
person requests a search on a keyword or phrase, the search engine
software searches the index for relevant information. The software
then provides a report back to the searcher with the most relevant
web pages listed first.