Thursday, February 4, 2016

Goal of Search Engines & How They Work

Goal Of Search Engines

Many people think search engines have a hidden agenda. This simply is not true. The goal of the search engine is to provide high-quality content to people searching the Internet.

Search engines with the broadest distribution network sell the most advertising space. As I write this, Google is considered the search engine with the best relevancy. Their technologies power the bulk of web searches.

The biggest problem new websites have is that search engines have no idea they exist. Even when a search engine finds a new document, it has a hard time determining its quality. Search engines rely on links to help determine the quality of a document. Some engines, such as Google, also trust websites more as they age.

The following bits may contain a few advanced search topics. It is fine if you do not necessarily understand them right away; the average webmaster does not need to know search technology in depth. Some might be interested in it, so I have written a bit about it with those people in mind. (If you are new to the web and uninterested in algorithms, you may want to skip past this to the search result image on page 35.)

I will cover some of the parts of the search engine in the next few pages while trying to keep it somewhat basic. It is not important that you fully understand all of it (in fact, I think it is better for most webmasters if they do not worry about things like Inverse Document Frequency, as I ranked well for competitive SEO-related terms without knowing anything about the technical bits of search); however, I would not feel right leaving the information out.

The phrase vector space model, which search algorithms still heavily rely upon today, goes back to the 1970s. Gerard Salton was a well-known expert in the field of information retrieval who pioneered many of today’s modern methods. If you are interested in learning more about early information retrieval systems, you may want to read A Theory of Indexing, which is a short book by Salton that describes many of the common terms and concepts in the information retrieval field.

Mike Grehan’s book, Search Engine Marketing: The Essential Best Practices Guide, also discusses some of the technical bits to information retrieval in more detail than this book. My book was created to be a current how-to guide, while his is geared more toward giving information about how information retrieval works.

While there are different ways to organize web content, every crawling search engine has the same basic parts:

a crawler
an index (or catalog)
a search interface

The crawler does just what its name implies. It scours the web following links, updating pages, and adding new pages when it comes across them. Each search engine has periods of deep crawling and periods of shallow crawling. There is also a scheduler mechanism to prevent a spider from overloading servers and to tell the spider what documents to crawl next and how frequently to crawl them.

Rapidly changing or highly important documents are more likely to get crawled frequently. The frequency of crawl should typically have little effect on search relevancy; it simply helps the search engines keep fresh content in their index. The home page of CNN.com might get crawled once every ten minutes. A popular, rapidly growing forum might get crawled a few dozen times each day. A static site with little link popularity and rarely changing content might only get crawled once or twice a month.

The best benefit of having a frequently crawled page is that you can get your new sites, pages, or projects crawled quickly by linking to them from a powerful or frequently changing page.