In this knowledge article, we discuss all aspects relevant to website crawling. We discuss how it works, what types of crawlers there are, and how you can influence them. Want to read along?

Expertise SEO

By Rick de Bie

Importance of crawling

How does a crawler work?

Tools for crawl-optimalisation

Which crawlers are there?

Crawlbudget explained

Checklist

Crawling and indexation

Introduction to crawling

When we talk about crawling, we are referring to the way search engines such as Google and Microsoft Bing find your pages, view them and store the information in their database. Those first two steps are part of crawling. Storing the information in the database is called “indexing” (another important concept).

Crawling and optimising it are highly technical parts of SEO, so we will try to explain it in an accessible way.

The importance of crawling

Crawling is a fundamental part of the search engine process. By crawling, they can discover new pages or include changes to existing pages in the index. If the crawler encounters many links (both internal and external) to a page, it can refresh it more often in the index, increasing the chance that it will rank higher in search results or be better understood by the search engine. Optimising your website for crawling is therefore extremely important for your SEO success!

How does a crawler work?

A crawler, also known as a spider, bot or robot, is essentially a system that is released onto the internet to explore it and view the content it encounters.

When it encounters a hyperlink to another page on a page, it will follow it and view the next page. This is how the bot crawls the internet, like a spider in a web.

Discovering and viewing these pages is the first step towards indexing, or “recording these pages in the search engine’s database”.

What tools do you have for crawl optimisation?

There are various tools you can use to make it as easy as possible for search engines to crawl your website in the most efficient way possible:

Robots.txt

With robots.txt, you can tell search engine robots, SEO tools, and other crawlers which parts of the website should not be crawled. Basically, robots.txt is the first line of defence. You block access to irrelevant (or unwanted) pages, media files, and source files.

Sitemap.xml

An XML sitemap is a file that search engines can use to get an overview of all pages, files and paths within the website at once. If a URL is included in the sitemap, the search engine will recognise the URL and the chance of it being crawled is much greater. Most CMSs can be provided with a sitemap via a plug-in.

Links (intern and extern)

Both internal and external links (also known as “hyperlinks”) contribute to the crawlability of your website. It is important that the hyperlinks in the code comply with all HTML standards. The search engine can follow the links to discover new pages. If the link has good “anchor text”, this also gives the crawler some context about the page being linked to.

Meta-robots

The meta-name=‘robots’ tag has two important functions (for each individual page). With “index” and “noindex”, you indicate whether or not a page may be included in the index. In addition, with “follow” and “nofollow”, you indicate whether the hyperlinks on this page may be followed by the search engine. This therefore influences the behaviour of the crawlers. These tags work independently of each other.

What types of crawlers are there?

There are various types of crawlers that can visit your website, but they can almost all be divided into one of the following categories:

Type of crawler	Companies	Name of bot(s)	Purpose
Search engines	GoogleBing	GoogleBot, Adsbot Bingbot, AdIdxBot	Crawling and Indexering
LLM’s	OpenAIAnthropic	GPTBot, ChatGPT-User/2.0ClaudeBot, anthropic-ai	Use as a search engine, train AI models, and perform AI tasks.
SEO-tools	AhrefsMoz Semrush	AhrefsBotRogerbot SemrusBot	Reporting website signals (browser)
Crawl-tools	Screaming Frog	Screaming Frog SEO Spider	Reporting website signals (application)

The purpose of the various crawlers varies. While search engines are busy finding new, relevant pages for the index, the tools are mainly concerned with mapping the entire site in order to find areas for improvement in terms of crawling, indexing and optimising the website.

For most people reading this, the bots of AI platforms will be new. These bots and crawlers work slightly differently than the bots of Google, for example. Because AI platforms also want to perform ‘tasks,’ we see that they can be divided into bots that crawl and so-called users that perform tasks. Finally, there are crawlers that are intended to ‘train’ the LLMs. For your website, you therefore need to determine what you want to make your content available for.

How does crawl budget work?

Crawl budget is basically an indication of how often search engines want to crawl your website and how long they want to do so. There is a maximum limit, which makes sense, because the internet is many times larger than the search engines’ index can handle. How much crawl budget there is for your website depends on factors such as:

The total number of pages the search engine can crawl.
The popularity of your website (i.e. external factors).
The extent to which your website is updated or expanded.

From this, you can conclude, for example, that it is important for the website to remain “fresh”, that the search engine finds links to your website on other sites, and that there is enough information to crawl.

Influencing crawl budget

There are several ways in which you can positively (or negatively) influence the crawl budget or its use. First of all, good technical health is required. Negative response codes or slow response times cause the crawl budget to be used incorrectly or inefficiently. This can result in the search engine executing fewer crawl requests (see graph).

Adding content and updating pages creates a “freshness” that search engines need to stay interested. This will have a positive impact on the crawl budget, just as outdated and unupdated websites have a negative impact.

Optimisation checklist

The checklist below will help you optimise your website’s crawling in the best possible way. Read through the steps, and if you need more information or are unable to figure it out, please contact us!

Robots.txt has been updated

Robots.txt is used to prevent search engines from visiting certain pages. This may be unnecessary for small websites, but for larger websites (+1,000 pages), you can prevent search engines from visiting irrelevant pages. In other words, more budget for more relevant pages.

Test: Test your robots.txt-file

Sitemap.xml-file is up to date

The sitemap.xml should contain all the pages you want to have indexed. The sitemap serves as a menu for search engines, allowing them to see at a glance which pages can be found and crawled within the website. Deleted and redirected pages are (usually) better left out.

Create: Create an XML-sitemap

4xx, 3xx en 5xx-status codes checked

When crawlers encounter negative status codes (list of negative status codes for crawling), they will adjust the speed at which they crawl. You can check this by regularly performing a crawl yourself.

Overview of pages checked in Search Console

In Google Search Console, you can see an overview of “Found”, “Crawled” and “Indexed” pages. Have you noticed that a new or important page is not appearing in the search results? Then there is a good chance that it has not yet been found or indexed. In that case, you can try to request indexing manually!

Fast server response time

The server response time affects the total number of pages that the crawler can crawl in a session. If you have a longer (read: slower) Time-to-First-Bite, the crawler can visit and crawl fewer pages in a session. This has a negative impact on how quickly changes or new pages are indexed. Test this with PageSpeed Insights or check the crawl statistics in Search Console.

Crucial role of crawling during a migration

During the migration of your website, you want the search engine to crawl and index the correct pages in a timely and accurate manner, and remove the old URLs (where necessary) from the index. For this reason, search engines such as Google have a mechanism that increases the crawl budget and frequency when site-wide changes occur, such as a migration. That’s nice!

Crawling and indexing

As mentioned above, crawling a page is the first step that must be taken before it can be found in the search engine. After it has been crawled, the search engine assesses whether the page will be indexed. Usually, if the content is unique and relevant enough, the search engine will include it in the index, after which it can be displayed as a result during a search query. Simply put, every search engine follows these steps:

Crawling
Indexing
Serving

It is not always that simple, and each step involves many processes, mathematical formulas and sometimes some flaws, but in fact this is how search engines work.

Frequently asked questions

What is a crawler?

A crawler, also known as a spider, bot or robot, is essentially a system that is released onto the internet to explore it and view the content that the crawler finds.

What is the crawl budget?

The search engine uses a certain number of pages per website that can be crawled per day. This is called the crawl budget, and it is different for each website. It is based on the estimated number of pages that the search engine can use, how popular your website is, and how often the pages and content are updated.

Why is crawling important?

The crawlability of a website ensures that it and its content can be found in search engines. These search engines determine which pages are relevant to include in their index so that they can then be displayed in the search results.

UNCOVER YOUR VISIBILITY POTENTIAL NOW!

Show me! Marketing dashboard

CRAWLING & HOW IT WORKS: CRAWLERS, SPIDERS, AND ROBOTS

Introduction to crawling

The importance of crawling

How does a crawler work?

What tools do you have for crawl optimisation?

What types of crawlers are there?

How does crawl budget work?

Influencing crawl budget

Optimisation checklist

Robots.txt has been updated

Sitemap.xml-file is up to date

4xx, 3xx en 5xx-status codes checked

Overview of pages checked in Search Console

Fast server response time

Crucial role of crawling during a migration

Crawling and indexing

Frequently asked questions

What is a crawler?

What is the crawl budget?

Why is crawling important?

UNCOVER YOUR VISIBILITY POTENTIAL NOW!

WE WILL GLADLY ADVISE YOU

CRAWLING & HOW IT WORKS: CRAWLERS, SPIDERS, AND ROBOTS

Introduction to crawling

The importance of crawling

How does a crawler work?

What tools do you have for crawl optimisation?

What types of crawlers are there?

How does crawl budget work?

Influencing crawl budget

Optimisation checklist

Robots.txt has been updated

Sitemap.xml-file is up to date

4xx, 3xx en 5xx-status codes checked

Overview of pages checked in Search Console

Fast server response time

Crucial role of crawling during a migration

Crawling and indexing

Frequently asked questions

What is a crawler?

What is the crawl budget?

Why is crawling important?

UNCOVER YOUR VISIBILITY POTENTIAL NOW!

Call me back!

WE WILL GLADLY ADVISE YOU