When diagnosing bugs and predicting Search behaviour on your site, it’s critical to understand how Google Search crawls, indexes and serves content.
Crawling
The process through which Googlebot examines new and updated pages in order to add them to the Google index is known as crawling.
Google employs a massive computer network to fetch (or “crawl”) billions of online pages. Googlebot is the program that performs the fetching (also known as a robot, bot or spider). To choose which sites to crawl, how often, and how many pages to fetch from each site, Googlebot uses an algorithmic approach.
Google’s crawl process starts with a list of web page URLs compiled from previous crawls, which is supplemented with data from sitemaps submitted by website owners. When Googlebot visits a page, it looks for links and adds them to its list of crawlable pages. The Google index is updated by noting new sites, changes to existing sites and dead links.
During the crawl, Google uses the most recent version of Chrome to render the website. It runs any page scripts it detects as part of the rendering process. Make sure you follow the JavaScript SEO principles if your site contains dynamically generated content.
Primary crawl and Secondary crawl
Google crawls websites using two separate crawlers: a mobile crawler and a desktop crawler. Each crawler type simulates a user using a device of that type to visit your page.
As the primary crawler for your website, Google uses one of two types of crawlers (mobile or desktop). The primary crawler crawls all pages on your website that are indexed by Google. The mobile crawler is the primary crawler for all new websites.
In addition, Google uses the other crawler type to re-crawl a few pages on your website (mobile or desktop). This is known as a secondary crawl and it is used to test how well your website operates on different device types.
How does Google figure out which pages aren’t worth crawling?
-
Pages that are blocked in robots.txt will not be crawled, but they may be indexed if they are linked to by another page. Google may index a page without analyzing its contents by inferring its content from a link pointing to it.
-
Google is unable to crawl any pages that are not accessible by an anonymous user. As a result, any login or authorization protection prevents a website from being crawled.
-
The pages that have already been crawled and are considered duplicates of another page, the crawl frequency is reduced for those pages.
For Improved Crawling
To assist Google in finding the right pages on your site, use the following techniques:
-
Provide a sitemap.
-
Make requests for individual pages to be crawled.
-
For your pages, use basic, human-readable and logical URL paths, as well as clear and direct internal links inside the site.
-
Employ the URL parameters tool to alert Google about crucial parameters if you use URL parameters for navigation on your site, such as indicating the user’s country on a worldwide purchasing site.
-
Use robots.txt with caution: To protect your server load, use robots.txt to notify Google which pages you’d like Google to know about or crawl first, rather than to restrict information from appearing in the Google index.
-
Use href lang to point to alternate language versions of your page.
-
Make a clear distinction between your canonical and alternate pages.
-
Use the Index Coverage Report to see your crawl and index coverage.
-
Ensure that Google has access to the important pages as well as the significant resources (images, CSS files and scripts) required to render the page properly.
-
Run the URL Inspection tool on the live page to ensure that Google can access and render your page properly.
Indexing
Each page Googlebot crawls is processed in order for it to grasp the page’s content. Textual content, important content tags and attributes, such as <title> tags and alt attributes, images and videos are all processed in this way. Many, but not all, content types can be processed by Googlebot. For instance, it can’t process the content of some rich media files.
Google assesses whether a page is a duplicate or canonical of another page in between crawling and indexing. If a page is flagged as a duplicate, it will be crawled less frequently. Similar pages are gathered into a document, which is a collection of one or more pages that includes the canonical page (the most representative of the group) as well as any duplicates discovered (which might simply be alternate URLs to reach the same page or might be alternate mobile or desktop versions of the same page).
Please note: Google does not index pages that include a noindex directive (header or tag). It must, however, be able to read the directive; if the page is blocked by a robots.txt file, a login page or another device, the page may still be indexed even if Google does not access it.
For Improved Indexing
There are a number of ways to improve Google’s ability to understand your page’s content:
-
Use the noindex tag to prevent Google from crawling or finding pages you want to hide. If you noindex a page that has been blocked by robots.txt, the noindex tag will be ignored and the page may still be indexed.
-
Make use of structured data.
-
Make sure you’re following Google’s Webmaster Guidelines.
Serving
When a user types a query into the Google search engine, its computers look through the index for pages that match the query and deliver the results that it believes are the most relevant to the user. Hundreds of factors influence relevance and Google is constantly working to improve its system. When Google chooses and ranks results, it evaluates the user experience, so make sure your page loads quickly and is mobile-friendly.
For Improved Serving
There are several strategies to improve how Google serves your page’s content:
-
You can tell Google your preferences if your results are focused on users in specific locations or languages.
-
Make sure your page is mobile-friendly and loads quickly.
-
To prevent frequent problems and boost your site’s ranking, follow the Webmaster Guidelines.
-
Consider adding features to your site based on search results, such as recipe cards or article cards.
-
Use AMP to make sites load faster on mobile devices. For additional search features, such as the top stories carousel, some AMP pages are also eligible.
-
Google’s algorithm is continuously being improved; rather than attempting to forecast the algorithm and design your website accordingly, focus on producing high-quality and original content that users want while adhering to Google guidelines.