Goal Of Search Engines
Many people
think search engines have a hidden agenda. This simply is not true. The goal of
the search engine is to provide high-quality content to people searching the
Internet.
Search engines
with the broadest distribution network sell the most advertising space. As I
write this, Google is considered the search engine with the best relevancy.
Their technologies power the bulk of web searches.
The biggest
problem new websites have is that search engines have no idea they exist. Even
when a search engine finds a new document, it has a hard time determining its
quality. Search engines rely on links to help determine the quality of a
document. Some engines, such as Google, also trust websites more as they age.
The following
bits may contain a few advanced search topics. It is fine if you do not
necessarily understand them right away; the average webmaster does not need to
know search technology in depth. Some might be interested in it, so I have
written a bit about it with those people in mind. (If you are new to the web
and uninterested in algorithms, you may want to skip past this to the search
result image on page 35.)
I will cover
some of the parts of the search engine in the next few pages while trying to
keep it somewhat basic. It is not important that you fully understand all of it
(in fact, I think it is better for most webmasters if they do not worry about
things like Inverse Document Frequency, as I ranked well for competitive
SEO-related terms without knowing anything about the technical bits of search);
however, I would not feel right leaving the information out.
The phrase vector space model, which search
algorithms still heavily rely upon today, goes back to the 1970s. Gerard Salton
was a well-known expert in the field of information retrieval who pioneered
many of today’s modern methods. If you are interested in learning more about
early information retrieval systems, you may want to read A Theory of Indexing, which is a short book by Salton that
describes many of the common terms and concepts in the information retrieval
field.
Mike Grehan’s
book, Search Engine Marketing: The
Essential Best Practices Guide, also discusses some of the technical bits
to information retrieval in more detail than this book. My book was created to
be a current how-to guide, while his is geared more toward giving information
about how information retrieval works.
While there
are different ways to organize web content, every crawling search engine has
the same basic parts:
- a crawler
- an index (or catalog)
- a search interface
The crawler
does just what its name implies. It scours the web following links, updating
pages, and adding new pages when it comes across them. Each search engine has
periods of deep crawling and periods of shallow crawling. There is also a
scheduler mechanism to prevent a spider from overloading servers and to tell
the spider what documents to crawl next and how frequently to crawl them.
Rapidly
changing or highly important documents are more likely to get crawled
frequently. The frequency of crawl should typically have little effect on
search relevancy; it simply helps the search engines keep fresh content in
their index. The home page of CNN.com might get crawled once every ten minutes.
A popular, rapidly growing forum might get crawled a few dozen times each day.
A static site with little link popularity and rarely changing content might
only get crawled once or twice a month.
The best
benefit of having a frequently crawled page is that you can get your new sites,
pages, or projects crawled quickly by linking to them from a powerful or
frequently changing page.
0 komentar:
Post a Comment