What are Web crawlers or Web Spiders?
In simple language, web spiders are the bots or programs used by various search engines to get details about your website and index them. They can browse each kind of content such as text content, images, links on pages, sitemaps, etc. They browse the website automatically and gather information from websites to index them.
Here, we are sharing a list of all web crawlers used by the different search engines. This list will help you to make a better robots.txt file for your website by allowing or blocking the required user agents.
Types of Web Crawlers
SearchBots: These are the search bots used by the search engine to crawl websites, views images, and links, and index them on the internet.
Here are some common SearchBots:- GoogleBot – used by Google, BingBot – used by Bing, SlurpBot – used by Yahoo, etc
CommercialBots: These are the bots used by some SEO websites to provide you with SEO reports of a particular website so that you can solve any SEO issues on the Site. For e.g Ahrefsbot – Used by ahref.com, SemrushBot – Used by Semrush.com, etc
Feed Fetchers Bots: These are the bots used to collect thumbnails and titles of the contents to display on their website. For e.g. Facebook external hit – Used by the Facebook website. Twitter bot – used by Twitter.
Monitoring Bots: These are checking bots that are used to check the performance of the websites like uptime, pinback, etc. For e.g. WordPress (pingback) – Used by WordPress. (not covered in this post)
List of web crawlers and their User-agents
1. GoogleBot
What is Googlebot?
Googlebot is the most active good bot that is used by Google to view the contents of your website and index them. They actively visit your website and go through all your content.
User-Agent
Googlebot
User-Agent string
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot example in robots.txt
Below is an example showing how to prevent Google from indexing your webpage https://example.com/exnoindex/donotindexthis.html
User-agent: Googlebot
Disallow: /exnoindex/donotindexthis.html
If you want to restrict Google to index your complete website, you can use the below line in your robots.txt
User-agent: Googlebot
Disallow: /
Apart from Googlebot, google uses more than 9 user agents for different crawling purposes.
Below is the list of all web crawlers used by Google.
User Agents | Crawlers Details | Full User String |
---|---|---|
Mediapartners-Google | Used for Google Adsense | Mediapartners-Google |
AdsBot-Google-Mobile | Use to show ads on Mobile apps (Android/iPhone) | Android:- Mozilla/5.0 (Linux; Android 5.0; SM-G920A) AppleWebKit (KHTML, like Gecko) Chrome Mobile Safari (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html) Iphone:- Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html) |
AdsBot-Google | Use to show ads on the webs | AdsBot-Google (+http://www.google.com/adsbot.html) |
Googlebot-Image Googlebot | Used to crawl images from websites | Googlebot-Image/1.0 |
Googlebot-News Googlebot | Used to crawl news | In 2011, Google declared that Googlebot will be used to crawl News. However, Googlebot-News will still respect the robots.txt of the website. |
Googlebot-Video Googlebot | Used to index your videos from websites and youtube. | Googlebot-Video/1.0 |
Google Favicon | Show your favicon in the google search result | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon |
you can find the rest of the bot’s details here Googlebots.
2. Bingbot
What is Bingbot?
Bingbot is a web crawler Bing uses to crawl website contents and images and index them in Search Engine. It replaced the MSNbot back in 2010.
User-Agent
Bingbot
User-Agent string
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
W.X.Y.Z Safari/537.36
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Below is the list of all web crawlers used by Bing:
User Agents | Crawlers Details | Full User String |
---|---|---|
Bingbot | Used to crawl website contents | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/ Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) W.X.Y.Z Safari/537.36 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) |
AdIdxBot | Used by Bing ads. They crawl the ads and follow the link to the ads | Mozilla/5.0 (compatible; adidxbot/2.0; +http://www.bing.com/bingbot.htm) Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; adidxbot/2.0; +http://www.bing.com/bingbot.htm) Mozilla/5.0 (Windows Phone 8.1; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 530) like Gecko (compatible; adidxbot/2.0; +http://www.bing.com/bingbot.htm) |
BingPreview | Used to generate previews of the website for Bing | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/W.X.Y.Z Safari/537.36 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) |
MicrosoftPreview | It generates snapshots for Microsoft products | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MicrosoftPreview/2.0; +https://aka.ms/MicrosoftPreview) Chrome/W.X.Y.Z Safari/537.36 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; MicrosoftPreview/2.0; +https://aka.ms/MicrosoftPreview) |
Bingbot example in robots.txt
Use the below command in your robots.txt to prevent a particular page from being index in Bing
Useragent: Bingbot
Disallow: /exnoindex/donotindexthis.html
If you want to restrict Bing from indexing your complete website, you can use the below line in your robots.txt
User-agent: Bingbot
Disallow: /
You can use the Robots.txt tester to validate your robots.txt file. Find more detail about creating robots.txt for Bing.
3. Slurpbot
Slurp is a web crawler used by Yahoo. Yahoo gets its search results from Slurp and Bing web crawlers. While the majority of Yahoo results are powered by Bing, it is advised to allow Slurpbot to get your website to appear in Yahoo mobile search results.
Apart from search, Slurp also helps to collect content from sites and include them in sites like Yahoo News, Yahoo Finance, and Yahoo Sports.
User-agent
Slurp
User-Agent string
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Example of code in a robots.txt file to allow index:
User-agent: Slurp
Allow: /
Read more documentation on Slurp
4. DuckDuckBot
Similar to other search engines, DuckDuckGo uses a web crawler known as DuckDuckBot. DuckDuckGo has now become quite a popular browser because it doesn’t track users and respects their privacy. DuckDuckGo respects robots.txt rules as well.
User-agent
DuckDuckBot
User-Agent string
DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)
Read more about DuckDuckBot
5. Baiduspider
As Google doesn’t operate in China, Baidu is the most used search engine there and Baiduspider is the official name of the crawler used by Baidu.
Like any other search engine crawler, Baiduspider visits your websites, reads your content, and indexes them based on relevancy.
User-agent
Baiduspider
User-Agent string
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Just like Google and Bing, Baidu uses multiple bots for different content. List of all the crawlers of Baidu:
User Agents | Crawlers Details |
---|---|
Baiduspider-image | Baidu Image Search |
Baiduspider | Baidu Web/Mobile Search |
Baiduspider-video | Baidu Video Search |
Baiduspider-cpro | Baidu Union Search |
Baiduspider-news | Baidu News Search |
Baiduspider-favo | Baidu Bookmark Search |
Baiduspider-ads | Baidu Business Search |
Read more about Baidu Spider
6. Yandex Bot
Yandex Bot is the Yandex search engine crawler that visits your website and helps them get indexed on Yandex Search Result.
Yandex is the largest Search Engine in Russia. So if your targeted audience lies in Russian countries, you probably don’t want to block Yandex.
User-agent
YandexBot
User-Agent string
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
User Agents | Crawlers Details | Full User String | Follow robots.txt? |
---|---|---|---|
YandexAccessibilityBot | YandexAccessibilityBot downloads pages to check their accessibility for users. | Mozilla/5.0 (compatible; YandexAccessibilityBot/3.0; +http://yandex.com/bots) | No |
YandexAdNet | The Yandex advertising network robot. | Mozilla/5.0 (compatible; YandexAdNet/1.0; +http://yandex.com/bots) | Yes |
YandexBlogs | The blog search robot that indexes post comments. | Mozilla/5.0 (compatible; YandexBlogs/0.99; robot; +http://yandex.com/bots) | Yes |
YandexBot | Detecting site mirrors. | Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots) | Yes |
YandexFavicons | Downloads the site’s favicon file to display in search results. | Mozilla/5.0 (compatible; YandexFavicons/1.0; +http://yandex.com/bots) | No |
Apart from this, there are many bots that Yandex uses.
7. MJ12bot
MJ12bot is a web crawler bot for Majestic, a UK-based search engine that operates in 13 languages in 60+ countries. Powers hundreds of thousands of businesses to get their website online.
It respects robots.txt.
User-agent
MJ12bot
8. Sogou Spider
Sogou is a Chinese Search Engine with an Alexa rank of 121 as of 2010. It was launched in 2004. It powers 10 billion web pages. Sogou Spider is the name of a web crawler used by Sogou.com to read website contents in index them on the internet.
User-agent
Sogou web spider
User-Agent string
Sogou Web Spider mobile user agent
MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 4.4.2; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 (compatible; Sogou web spider/4.0 ; +http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou Web Spider desktop user agent
Sogou web spider/4.0 (+http://www.sogou.com/docs/help/webmasters.htm#07)
9. Exabot
Exabot is the web crawler used by Exalead’s robot. It collects data from all around the world and supplies it to search engines. Exabot collects data and includes it in the main index of Exalead and thereby included in the search results of Exaleads
User-agent
Exabot
Example of robots.txt to prevent indexing of pages from a particular directory (for example, football):
User-agent: Exabot
Disallow: football
10. Alexa crawler
Alexa retired on May 1, 2022. Alexa was an American Web traffic analysis company by Amazon. Popularly known as Alexa rank by internet was a key metric of Alexa, that was based on estimated visitors of the websites per day.
11. Soso Spider
Soso Spider is an automated web crawler for the Soso search engine owned by Tencent Holdings Limited, famous for QQ. Soso is the 13th most visited website in china and 36th in the world with over 20m page views daily.
User-agent
Sosospider
Sosospider+
User-Agent string
Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)
12. Pinterestbot
Pinterestbot is a crawler used by Pinterest to download images of products from your website’s catalog. It also downloads metadata of the products including price, availability, and description.
It also checks the authenticity of the website under pin pictures.
User-agent
Pinterestbot
User-Agent string
Pinterest/0.2 (+https://www.pinterest.com/bot.html)
Mozilla/5.0 (compatible; Pinterestbot/1.0; +https://www.pinterest.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Pinterestbot/1.0; +https://www.pinterest.com/bot.html)
You can restrict pinterest from crawling your site by using below command in robots.txt
user-agent: Pinterestbot
disallow: /
PinterestBot respects robots.txt rules.
13. SemrushBot
Semrush bot is a search bot software that Semrush uses to collect SEO data of your sites and use them for analytics including On-page SEO, backlinks, content analysis, and many more.
It constantly crawls your websites to get updated data. If you do not use any Semrush tools or do not intend to use this in the future, it a wise advice to block this bot.
Semrush uses different bots for different tools:
User Agents | Crawlers Details |
---|---|
SiteAuditBot | To find different SEO and technical issues. |
SemrushBot-BA | For the backlink audit tool. |
SemrushBot-SI | On-Page SEO Checker tool and similar tools. |
SemrushBot-SWA | Checking URLs on your site for the SWA tool. |
SemrushBot-CT | Content Analyzer and Post Tracking tools |
SplitSignalBot | SplitSignal tool |
SemrushBot-COUB | Content Outline Builder tool |
Semrush follows robots.txt rules, you can block these crawlers by adding rules in robots.txt files
User-agent: SemrushBot
Disallow: /
14. Dotbot
Similar to Semrush, Moz uses Dotbot to find Seo and technical issues on a website. Moz is a Seo tool used for keyword research, backlink finding, and many more tools.
Data collected by Dotbot can be accessed only through pro account of MOZ, so if you ever plan to use pro membership of moz, you can allow dotbot to crawl your site. Or simply block it to save your bandwidth.
User-agent
dotbot
Block Moz from crawling your site:
User-agent: dotbot
Disallow: /
15. AhrefsBot
Again, ahrefs is a marketing tool used for link building and website SEO audit. Ahrefsbot is used to scrap your website data and provide you with audit reports including technical issues from your website. This report is then used to improve your website SEO and much more.
Again if you are not planning to use ahrefs marketing tool, you can block their bot:
User-agent
AhrefsBot
Block Ahrefs bot from crawling your site:
User-agent: AhrefsBot
Disallow: /
Find more detail on Ahrefsbot
16. Facebook external hit
Facebook external hit is the web crawler used by Facebook to gather metadata such as thumbnails, titles, and descriptions of the post. Whenever you copy-paste links from a website to FB, the FB crawler hits the website and collects metadata to show to FB users.
You should not block this bot if you plan to share your post of FB.
User-agent
facebookexternalhit
User-Agent string
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
Example in robots.txt:
User-agent: facebookexternalhit
Disallow: /
read more on facebook crawler.
17. archive.org_bot
Wayback Machine or Internet archives saves a copy of your website in their database of around 150 billion web pages. They use archive.org_bot to keep a snapshot of the web page or a book or probably any online elements, these are then stored and can be accessed by anyone using their website.
I personally block this bot.
User-agent
archive.org_bot
Example in robots.txt
User-agent: archive.org_bot
Disallow: /
Conclusion
With this, we have come to the end of our web crawler lists. I hope this list will help you to properly allow or block the user agents that harm your bandwidth and provides no value to you.
You should be now able to distinguish between good and bad bots. This list will help you to design a better robots.txt file for your website.