Web crawler, web spider and robot are the names for computer programs that search and review website URLS or website data. They are based on a specific program and operate methodically according to this program. Let’s take a look at what web spiders and web crawlers do, what they mean for a website or blog owner, and how to keep out the bad ones.
What Are They
Web spiders were first designed by search engines to help compile a list of relevant sites. To this day, this still remains their main purpose. These robots automatically index information based on a range of criteria that are set into a specific program.
Besides search engines, other companies and individuals use web spiders to find information on a variety of activities such as website visits or surfing behavior. Linguists are also common users of robots. Believe it or not they use them to work out what common words are being used on the Internet. Ingenious!
Why You Care
For you blog owners or website owners it is important to consider these little guys when you are developing your site. By understanding how they work, it is a lot easier to rank well in the search engines. For example they always search sites based on specific key words. To help the spider out, have a specific keyword or keyword phrase for each of your pages. They also look at a site map or index. By developing a compatible site map, you are ensuring that the spider can easily find and index your site.
Unfortunately not all spiders serve a good or valuable purpose on the Internet.
While the search engine crawlers and many others use the information indexed by the web crawlers for good, there are those that try and obtain nonpublic information and use it for things you don’t want. In the last few years, the most common bad spiders have been those that obtain email addresses for spam. Any time you sign up to a site with an email address there is a small risk that bad spiders could obtain your data.
The quickest way to tell if a robot is good or bad is to see what impact it has on your website. If you are seeing more good results than bad, then you’re being visited by good robots. If you are getting more bad results than good it is a bad robot. Simple!
I noticed that last month we had two backlinks (good!) from an escort service in Paris (bad links!! bad robots!!). Crazy, huh? Of course, this could have been from robots who visited our site, or it could have been from robots visiting a site that has one of our articles posted on it. Bad robot behavior! Fortunately, our overall bad results are very small.
How To Handle Bad Robots
Luckily there are a range of techniques that you can use to stop bad web crawlers from coming onto your site and stealing your data. For the purpose of safety, it is best to assume that any robot that isn’t related to a search engine is a robot that you don’t want to come to your site. The two most common techniques are as follows:
- Set up a captcha page: Before people sign up to leave comments or anything else on your site, use a captcha page which means that a human (rather than a robot) has to type in data.
If you would like to know what web crawlers have visited your site, there is an easy way to do this. Use the cpanel in your website dashboard. The Awstats function will quickly show you the latest visitors to your site and a lot more really interesting and useful information. In your BlueHost cpanel you can find the Awstats link in the Log section on the main cpanel page.
So these are the functions of spiders on the Internet. The most common is an indexer or data collection robot for search engines. While most web spiders are good, there are also some that cause problems to your site. By making changes to your htaccess files you can limit the number of bad web crawlers that visit your site. It also pays to check your visitor history to see just who or what has been visiting your site.