Skip to main content

"Anatomy" of a Search Engine

anatomy-of-a-search-engine

Search engine optimizers regularly try to figure out the algorithms of search engines machines to make it easier to explain what the process actually is from indexing to finding the desired result.

Good search engine operating at all its optimum performance, must provide effective location of web pages, complete coverage of the Web, the latest information, unbiased equal access to all information, convenient interface for users, in addition, issuing the most relevant results at the time of the request.

Providing meaningful access to large amounts of information is a difficult task. The most successful methods and approaches are completely dependent on information retrieval, the categorization of documents relies heavily on statistical technology.

There are five very important modules that make search engines work. Different terms are used for specific components, but we took the standard ones and hope that these descriptions and explanations are easier to understand, than those given in technical documents.

Let's highlight the five most important components of search engines:

□ crawler/spider module;

□ warehouse/database module;

□ indexer/link analysis module;

□ search/ranking module;

□ query user interface.

Search engines carefully store their indexing and ranking methods. web pages as trade secrets. Each search engine has its own unique system. And while the algorithms they use may differ, in practice, there are many similarities between them in the way they build index systems.

Understanding search engine optimization

Search engines find web pages in three ways:

□ by using the start collection of URL pages (in other words, web pages) and extracting links from them in order to follow them (for example, choosing them from directories);

□ from a list of URLs obtained from a previous Web exploration (using the first search results);

□ URLs are added artificially by webmasters directly to the search engine machine (using the Add URL option).

There are a lot of problems that search engine spiders face, due to the size of the Web and its constant growth and change. As you now know, unlike the traditional search for information, where every data is collected in one place and ready for verification, information is distributed on the Web between millions of different Internet servers.

This means that the information must first be collected and systematically distributed over large "warehouses" before it is available for processing and indexing.

In addition, a good search engine should have good filters that avoid many problems that may arise, consciously or not, created by website owners. These filters automatically get rid of millions of unnecessary pages.

Scroll to Continue

Modern search engines are self-tuning, i.e. they determine the frequency of scanning a particular site, depending on the set factors, such as the frequency of updating a web resource, its rating, etc.

There are many different types of crawlers on the web. There are some that apply for personal use directly from your computer desktop, and those that collect e-mail addresses, or all sorts of commercial crawlers that perform research, measure the Web and identify work of spyware.

The described crawlers, spiders and robots are automated programs, which are usually controlled by search engines, "crawling" on the links of the Web and collecting primary text data and other information for indexing. Early crawlers were programmed for fairly general purposes.

They are paid little attention to the quality or content of the pages, and are more focused on quantity. Their goal was to collect as many pages as possible. However, the Network then had much smaller volumes, so they were enough efficient in terms of discovering and indexing new web pages.

As the Web grew, crawlers faced many challenges: scalability, fault tolerance and bandwidth limitation. The speedy growth of the Web outstripped the capabilities of systems that were not sufficiently prepared, to thoroughly examine the downloaded information with which they faced.

An attempt to manage a set of programs simultaneously on such levels without damaging the system became impossible.

Today's crawlers that have appeared over the past few years as a result of the growth of the Web, have completely changed from the time of the early robots. However, although they still use the same basic technology, they are now programmed for more individual own multi-level systems.

Although "crawling" the Web is a very fast process, in fact, the crawler does the same actions as an ordinary Internet user.

The crawler starts with either a single URL or a set of pages. For example, pages listed in a specific directory, which it downloads, extracts hyperlinks and then crawls to the pages these links point to. As soon as the crawler stumbles upon a page with no links to follow, he comes back one level back and jumps to links he may have missed earlier, or to those links that were programmed in advance in the queue for next time.

The process is repeated from server to server until the pages there will be nothing more to download, or while some resources (time, limit throughput) will be reached or exhausted.

The word "crawler" is almost always used in the singular, but most search engines have many crawlers with a whole "fleet" of agents, doing massive work.

For example, Google as a new search engine generation, starting with four crawlers, each time opens approximately three hundred links. At peak speeds, they downloaded information over 100 pages per second. Google is now running on 3,000 computers with a Linux system, the total size of hard drives of which is more than 90 TB.

Google adds 30 new machines a day to their server, just to keep up with the growth. Crawlers use traditional schematic algorithms to explore networks. The graph is made up of what are called nodes and edges. Nodes are URLs, and edges are links nested within pages.

Edges are cutting edge links on your web pages that point to other pages and return links, that point back to your pages from anywhere else site. Network graphs can be represented mathematically for search purposes through the use of algorithms where the intersection or "initial latitude", or "initial depth" is placed.

Searching based on "starting latitude" means that the crawler restores all pages around the starting point of the "crawl" before links lead even further from start. This is the most common way that crawlers follow links.

A search based on "initial depth" can be used to follow all links, starting with the first link of the first page, then from the first link on the second page, and so on. As soon as the first link on each page is visited, the spider moves to the second link, and then to each subsequent one.

The search method, based on the "starting latitude", allows to reduce the load on servers, which is quickly distributed and helps to avoid any server to have to quickly respond to one or different requests constantly.

The "initial depth" method is easier to program than the "initial latitude", but due to its capabilities, it can lead to the addition of less important pages and a lack of fresh search results.

There is a question: how deep can a crawler penetrate a website? Much depends on the content of the websites that the crawlers encounter, as well as what pages the search engine already has in its database.

In many cases, more important information is contained at the beginning of the page, and then the further you go from the beginning of the page, the less important information it contains. The logic is that more important information for the user is always tried and put at the beginning of anything.

You just have to go to the site and you find that it, like many sites, does not have a clear structure, rules and standards, but often it is links that contain important information for the user information are located at the top of the page.

Search engines generally prefer to follow shorter URLs on each visited server, assuming that URLs with shorter components are likely to contain more useful information.

Spiders can be limited to a certain number subsections (slashes) of the site by which they will search for information. Ten slashes is the maximum depth; average depth, as a rule, are considered three slashes.

Important pages that are deeper in the site will probably have to directly register in the search engine to the owner of the site, as well as create a sitemap.

With the constant development of the Web and related technologies such as ASP, PHP and ColdFusion, often many important pages "hidden" in the depths of network databases, but this no longer applies to search engine algorithms.

This content is accurate and true to the best of the author’s knowledge and is not meant to substitute for formal and individualized advice from a qualified professional.

© 2022 Temoor Dar

Related Articles