Nearly 40% of the traffic on the world wide web consist of bots, most of the time we never notice it. But if you are reading this blog then it means that you came face to face with the bot traffic and now you want to do something about it. Bot traffic can produce a variety of problems, for example, your demand partners may pull the plug on you, your analytics numbers may get fuzzed, your rankings may fall due to slow speed, etc.
In this blog, we will see how to detect and block the bot traffic. But before starting to block it, you need to check whether there is bot traffic on your site. Even if it is there, you need to verify whether you should block it or not. Let us begin by validating the presence of bot traffic on your site.
Table of Contents
- How to Detect Bot Traffic using Google Analytics?
- Understand Ghost Traffic
- Understanding the Importance of Legit Bots
- Blocking the Invalid Bot Traffic
– Bot Management Solution
– Manual Blocking
– Web Application Firewall (WAF)
– WordPress Plugins
– IAB Bot lists
– Improving Paid Traffic
- What’s Next?
How to Detect Bot Traffic using Google Analytics?
Whenever there is bot traffic on a site, there are some common inconsistencies that appear in your website analytics. Here are some of the common anomalies.
- Increased Number of Pageviews
- Decreased Session Duration
- Increased Bounce Rate
- Increased Number of Pages per Session
- Decreased Bounce Rate
- Decreased Page Load Speed
Reasons Behind their Occurrence
- Increased Number of Pageviews – Bots that come to crawl your whole website will load multiple pages at the same time, this will create a spike on the number of page views.
- Decreased Session Duration – Bots are quick at collecting data from your pages, they do not need to read like humans, therefore they will visit every page for a few seconds. It brings down the average session duration of your entire website.
- Increased Bounce Rate – Bots like scrapers will scrape a page and move to the next site, such behavior causes a spike bounce rate.
- Increased Number of Pages per Session – The bots that visit to collect a large amount of data from your site can browse hundreds of pages in a single session. This behavior causes an unnatural increase in the number of pages per session. But sometimes bots are deliberately designed to visit just a few pages (say 2 pages) so that there is no spike in pages per session or the bounce rate.
- Decreased Bounce Rate – As opposed to scrapers just visiting a single page, when every bot starts to visit multiple pages on your site then you can see the opposite effect too. You will witness a sharp decline in the bounce rate.
- Decreased Page Load Speed – When a large number of boots attack your site then it can overload your server and make your website slow.
These were some but not all the signs of bot traffic.
Understand Ghost Traffic
When we talk about bot traffic, we assume that the bots come to our website, but there are other bots that do not come to our site at all. These bots come under ghost traffic. It appears in our Google Analytics reports but it never goes to our website. It mostly appears as referral traffic from irrelevant sites. It targets Google Analytics servers to add data to your reports.
The sole purpose behind sending ghost traffic is to make the webmaster curious. When you see that a site is sending a lot of referral traffic to your site, then you become tempted to check the source. Once you visit the referral site, the site owner can perform many fraudulent actions like hacking your PC, injecting viruses, cookie stuffing or even earning money by showing ads. Since the traffic is coming only to your analytics tool, you can simply filter it from your reports to see accurate data.
To filter out the ghost traffic, you need to make a list of all the hostnames that are sending ghost traffic to your analytics tool. The hostname is any domain where your Google Analytics tracking code is present. Therefore you need to choose the hostnames carefully. For example, Google hosts a cached version of your pages so that a user can access the content when your site is unavailable. In such a case, your tracking code will count it as a visit but the hostname will be “webcache.googleusercontent.com”. But you should not dismiss this visit as a bot visit because the user was real and the content was yours.
Most of the ghost traffic will be coming from “(not set)” hostname. Other invalid hostnames will look genuine for example “google.com” but when you look into the Source dimension then you will see some spammy URL, for example, “duckduckgoogle.com”.
Follow these steps to get the list of all the hostnames:
- Go to Google Analytics.
- Select a period of 1 year or more in the date range.
- In the reporting section, click on the Audience > Technology > Network.
- Select “Hostname” as the primary dimension.
Now from this list, you need to make the list of genuine hostnames (including your site name). After selecting the hostname you need to create a regular expression including these names. Now all you have to do is create a filter that includes only the valid hostnames so that the data from the ghost traffic is not included in your reports.
- Go to Admin at the bottom left corner of Google Analytics.
- Under the View section, go to Filters > Add Filter and give a name to the filter.
- Select Custom as Filter Type.
- Select Hostname under Filter Field.
- Under Filter Pattern, paste the regular expression you created.
- Save the filter once you have verified it.
Do not forget to update the filter whenever you add your tracking ID to a new place otherwise it won’t show the data from there. You can also take the opposite approach, i.e. excluding the bot related hostnames but in that case, you will have to keep excluding the hostnames of any new bots that appear afterward.
Understanding the Importance of Legit Bots
Are all Bots Harmful to the website?
Before moving forward to block the bot traffic from your site, you need to understand that some bots are important. There are bots that make your site work fluently and blocking such bots can be harmful to you. Therefore you should make sure that you are not blocking important bots like search engine crawlers, social media bots, bots from your partners, bots that monitor and secure your site, etc.
Related read: What are the Types of Ad Fraud and How Publishers Can Prevent Them?
How to Block the Invalid Bot Traffic?
There are multiple ways of blocking the bot traffic from your site, we will look at them one by one.
- Use Bot Management Solution
- Manually Block Invalid IP Addresses
- Use a Web Application Firewall (WAF)
- Use reCAPTCHA
- Use WordPress Plugins
- Use IAB Bot lists
- Improve Your Paid Traffic
– Use Bot Management Solution
Bot management solutions providers work for a single goal — the protection of websites from malicious traffic. They are the specialists and therefore they know a lot more than any normal website owner will know about site protection. More importantly, they have data about all the good and bad bots out there on the internet.
An updated database helps in fighting with all the latest bots coming up on the web. You need to contact a good bot management solution provider and it will help you in setting up the necessary system for your needs. Some well-known names in the field of bot management are Akamai, Radware, Netacia, Cloudflare, etc.
– Manually Block Invalid IP Addresses
Once you know the IP addresses of the bot traffic, you can simply block them from your cPanel. Log in to the cPanel of your website, you will find the IP blocking tools inside the security-related tab. If you do not know the IP addresses, then you need to create a “honeypot” for bots. Create an invisible link from your homepage so that it is only visible to the bots. Block the page with the Robot.txt file. Since good bots adhere to the rules of the Robots.txt file, we can assume that all the bots reaching to this page are bad bots.
Note that since the link is invisible, there is no chance that a real human user will reach it, and therefore all the traffic to this page is “bots-only”. Now you can go to Google Analytics and find the IP addresses of all the bots visited this page and block them from accessing your site from the cPanel. The downside of this method is that it is a manual process and therefore you need to keep updating your block list as you find new IP addresses.
Publishers using Apache Web Servers can block bots with the help of .htaccess file. Blocking can be done on the basis of IP address, HTTP referrer, and user agent. You need to create the .htaccess file with all the blocking instructions and upload the file to your directory using FTP. If you already have a .htaccess file in your server, then you can update it with the instructions.
– Use a Web Application Firewall (WAF)
A Web Application Firewall (WAF) is a common solution used by websites to protect themselves from security threats. WAF in itself is a very broad topic, but to understand it in a simple way, we can say that it generally acts as a shield between the website (or a web app) and the client. Since it sits between the server and the client therefore the resources first go to the WAF and then they are sent to the client, in other words, it works as a reverse proxy server.
The setup of the WAF can be host-based, network-based, or cloud-based. There are many free open-source as well as commercial WAF available to choose from. Advanced WAFs like Akamai can keep observing the HTTP requests to stop malicious attacks even before they reach the servers.
– Use reCAPTCHA
You must have seen the small rectangular checkbox on many websites that says, “I’m not a robot”, it is called reCAPTCHA. When a user clicks on the box, it studies the movement of the cursor to differentiate a bot from a human. Humans always have a little randomness in the movement whereas robots work straight. If the reCAPTCHA test cannot decide on the basis of the mouse movement, then further difficult tasks such as identifying images are given to the user. The user is allowed to move further only after the test is passed.
Google provides reCAPTCHA V3 for free. You need to register your site before starting the integration. You can visit the reCAPTCHA guide for all the resources required.
– Use WordPress Plugins
WordPress has a huge library of plugins, anti-bot plugins are also among them. Since most websites use the WordPress platform many publishers can use them. Plugins can be a viable and easiest option for small publishers with limited resources.
Go to the WordPress Dashboard > Plugins > Add New and search for bot blocking plugins. Perform trial and error with the various plugins to find out which one works the best for you. Go to the settings page of the plugins (after installing them) to check various actions they can perform. You will be able to handle spambots and referral bots easily with the help of these plugins.
– Use IAB Bot lists
The Interactive Advertising Bureau provides two lists of identifiers to help you block the bot traffic. One of the lists is the blacklist and the other is a whitelist. The lists are updated every month. When a user agent matches an entry in the whitelist and does not match with any entry in the blacklist then it should be considered as a real user. When a user agent does not match with any entry in both the list then it should be considered as a bot. If a user agent passes the whitelist but does not pass the blacklist then it should be considered as a bot.
You can download this IAB/ABC International Spiders and Bots list from the IAB website. The list is not free, you have to pay for it by taking early membership. After getting access to the lists, you can implement them on your server.
– Improve Your Paid Traffic
Many publishers acquire traffic through content recommendation services like Taboola. It is obvious that if your content will be recommended on sites full of bots then the acquired traffic will have bots. Therefore you need to make sure that the source of your traffic has great quality.
Develop best practices for better traffic acquisition. Block spammy websites. Target the websites that worked for you earlier. Use keyword targeting cautiously and do not overdo keyword blocking. Keep improving the process as you find new data on what is working and what is not. Lower CPC and CTR can also land your links on low-quality sites, so find out the sweet spot where your campaign remains in budget while sending traffic from good quality sites.
These were some but not all the ways to keep bots away from your website. Safeguarding your site is a continuous process. As the new ways of fighting the bots come up, the bad players find out even newer ways to attack. Therefore website security solutions, as well as fraudsters, are constantly evolving. You need to stay on your toes to keep up with everything. Assess the risk your site has and then take measures accordingly.