Remove URLs using a robots.txt file


It’s likely that you have pages on your website that you don’t want indexed by search engines. These could be pages like your privacy policy or simply pages that you don’t want accessible to the public. For example, if the page is linked and accessible via your website, such as a privacy policy, you can block it using a robots.txt file.

Creating a robots.txt File

A robots.txt file is a standard text file that can be created with any text editor, such as Notepad, and saved with the .txt extension. Upload the robots.txt file to the root of your website so it can be found by search engines here: https://www.domain.com/robots.txt.

Denying Bots from Indexing Using a robots.txt File

To deny bots from accessing an entire website:

User-agent: *
Disallow: /

To deny all bots from indexing a specific page:

User-agent: *
Disallow: /page.html

To deny all bots from indexing a folder:

User-agent: *
Disallow: /folder/

To deny all bots from indexing any URL containing ‘monkey’ using a wildcard:

User-agent: *
Disallow: /*monkey

To deny dynamic URLs that contain a ‘?’, use the following method:

User-agent: *
Disallow: /*?

To specify which bot you want to block, you can change the User Agent. To deny Googlebot:

User-agent: Googlebot
Disallow: /page.html
Disallow: /folder/
Disallow: /*monkey
Disallow: /*?

How to Remove a Page from Google that Has Been Indexed

To remove a page that has already been indexed by Google, use the noindex directive within the HTML <meta> tag of the page or serve it via the HTTP header using an X-Robots-Tag. You can then log in to Google Search Console, go to URL Removal Tool, and request removal of the URL.

Example using the <meta> tag:

<meta name="robots" content="noindex">

Example using an X-Robots-Tag via the HTTP header:

X-Robots-Tag: noindex

When Not to Use a robots.txt File

Since anyone can view your robots.txt file, it’s important not to use it to block a private page or a page that hasn’t been linked from your website (in which case, bots wouldn’t be able to find it anyway).

Another issue is that not all dynamic URLs have a pattern that allows them to be easily blocked by a robots.txt file. For such cases, you can use another method by setting an X-Robots-Tag.

Using X-Robots-Tag

Setting an X-Robots-Tag is a more discreet way of blocking a URL. You can test your page header using an HTTP request and response header tool.

With PHP, you can tell bots not to index, archive, show a snippet, or ‘nofollow’ the links on the page:

header("X-Robots-Tag: noindex, nofollow, noarchive, nosnippet", true);

Using a .htaccess file, you can do the same using FilesMatch:

<FilesMatch "page\.html">
Header set X-Robots-Tag "noindex, noarchive, nosnippet"
</FilesMatch>

By using these methods, you can better manage which pages of your website are indexed by search engines and keep private or sensitive information from being accessible through search results.