Crawler

Crawler

A crawler is a computer program that traverses the internet and inspects the content of web pages. Crawlers are nearly always automated, and they are smart enough to navigate their way around the web on their own – without the need for human interaction.

Let’s look into how they operate in a little more detail.

For the purposes of this article we will be talking about the only crawler SEO’s really need to worry about – GoogleBot.

How The Crawler Works: Step 1 – Scraping

So GoogleBot arrives on a page and gets to work. If it is the first time it has scraped your website in a while, it will look for your robots.txt file to identify the structure of your site.

GooglebotThen, in a matter of seconds, it downloads the entire content of your web page. It does much more than just downloading the raw HTML, and executes some JavaScript, renders CSS and reportedly downloads images. (However not all content is reachable, and elements like Flash and other dynamic content is often not detected.)

When it has rendered all this data GoogleBot starts its intensive analysis procedures.

How The Crawler Works: Step 2 – Analysis

GoogleBot starts looking at the text of a web page on an incredibly detailed level. For example, it looks at KW density, it looks for relevancy, it looks for LSI keywords, it looks for synonyms, it even attempts to understand the topic of the page as a whole.

It also looks for non text-based variables. Things like the date the page was published, the speed that it loads at, and how accessible it is.

It dissects the page into various different sections – It doesn’t just take the article as a whole. It uses HTML elements on the page, and CSS styling to determine what is a title, or what is a caption or to try and identify things like the comments section.

Once the various elements have been analyzed then (if the page has been deemed high enough quality) it is added to Google’s index.

How The Crawler Works: Step 3 – Moving On

One of the things that all crawlers look for when analyzing a web page is its hyperlinks. It will compile them all into a big long list and then separate them into various categories.

For the purposes of this example, it will store the links in two lists – external links in one list, and internal links in another.

Nobody outside Google knows how it actually sorts these links into lists, but for the purposes of our example assume this two link idea is true.

It will look through the internal links list first and see when the pages in the list were last crawled. If a page is due to be re-crawled then GoogleBot will probably navigate to that page next and start step 1 all over again.

If the first link in the first list has been scraped recently, then it is discarded and the next one is selected until a suitable one is found.

When the internal links list runs out, GoogleBot takes a link from the external links list. If it needs to be re-crawled then it goes to that website and starts step 1 again.

How GoogleBot determines if a page needs to be re-crawled is determined by calculating a metric known as crawl budget. A websites crawl budget is Google’s estimation of how often a website is updated with new content that should be re-crawled.

The crawl budget is determined by two things, the frequency of content updates and the size of a website’s backlink profile.

The more links a page has, the more times it ends up in the second hypothetical list of GoogleBot and the more it gets crawled.