How to Configure Your robots.txt and llms.txt File – Practical Guide for Blogs, Online Shops and Business Websites

What is a robots.txt file?

If you want to optimize your website for search engines, you will sooner or later arrive at one important point: configuring your robots.txt file correctly.

The robots.txt file is a simple text file in the root of your domain, for example at:

https://www.yoursite.com/robots.txt

In this file you define rules for web crawlers such as Googlebot, Bingbot or other bots. You tell them which areas of your site may be crawled and which should be ignored.

Important:

  • The robots.txt file controls crawling, in other words which URLs a bot is allowed to request.
  • It is not directly responsible for whether a page ends up in the index – that is what meta tags like noindex are for.

You can think of the robots.txt file as a polite house rule for bots: “You may enter here, please stay out there.”

Structure and location of the robots.txt file

To make sure search engines can find your robots.txt file, it must always be stored in the root directory of the domain:

  • Correct: https://www.example.com/robots.txt
  • Wrong: https://www.example.com/folder/robots.txt

Subdomains each need their own robots.txt file if you want to control them separately, for example:

  • https://shop.example.com/robots.txt

A simple base structure looks like this:

User-agent: *
Disallow: 

Sitemap: https://www.example.com/sitemap_index.xml
  • User-agent: * means: the rules apply to all crawlers.
  • Disallow: without a path means: nothing is blocked, everything may be crawled.
  • With Sitemap: you link your XML sitemap, which is very useful for SEO.

A practical example based on your notes:

User-agent: *
Disallow: 
Disallow: /sh/
Disallow: /page/
Disallow: /tag/
Disallow: /de/tag/
Disallow: /en/tag/
Disallow: /cookies/
Disallow: /tags/
Disallow: /wp-content/cache/wpo-minify/
Disallow: /wp-content/uploads/wpo-plugins-tables-list.json

Sitemap: https://www.cosci.de/sitemap_index.xml

Here you exclude caches, tag pages and technical files from crawling and at the same time provide your sitemap.

What should a robots.txt file contain?

If you want to configure your robots.txt file correctly, you should keep it as lean and clear as possible. Typical contents are:

  • Standard rules for all crawlers
  • Optional rules for specific crawlers (for example AI bots)
  • Disallow rules for technical and unimportant directories
  • A link to one or more sitemaps

A good robots.txt file should:

  • never block important content
  • avoid wasting crawl budget on useless URLs
  • make it clear where the main content of your site lives

Which directories should be excluded?

Not every URL on your site is interesting for search engines. There are typical candidates you can consider in your robots.txt configuration.

Overview by site type

Site typeOften makes sense to blockImportant areas you should not block
Bloginternal search, pagination like /page/, tag archivesposts, categories, media
Online shopcart, checkout, customer account, internal searchproduct pages, categories, landing pages
Business websiteadmin area, test folders, internal toolsservice pages, contact, blog, portfolio
Generalcache folders, technical JSON filesCSS, JS, images, fonts

A common mistake is to block too much.
If you block entire CSS or JavaScript folders for example, Google can no longer render your site correctly. This can hurt your rankings, because the page looks broken from the crawler point of view.

Hide or index categories and tags?

Especially on blogs there is a recurring question: should you block tags and categories in the robots.txt file?

Basic idea:

  • Categories are usually structured around clear topics and can be strong landing pages.
  • Tags are often used very loosely and create many thin pages with little unique content.

Possible approach:

  • Categories: usually better to keep them indexable, as long as they are well maintained.
  • Tags: either maintain them properly and use them as landing pages, or remove them from the index using the noindex,follow meta tag.

I would rarely block all categories or tags via robots.txt. Often it is better to:

  • remove weak archives from the index using noindex,follow
  • still link them internally so that the crawler can use the links as signals

What to consider for blogs, online shops and business websites

When you configure a robots.txt file for WordPress, online shops or business websites, you can roughly follow these points:

  1. Keep important content crawlable
    • posts, pages, products, categories
  2. Block technical areas
    • admin areas, cache folders, log files, certain JSON files
  3. Declare your sitemap
    • link one or several sitemaps
  4. Secure staging and test systems
    • ideally use password protection, not just robots.txt

robots.txt for WordPress – typical settings

Many WordPress installations use something like this out of the box:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

This is a sensible starting point. In addition you can exclude certain directories, as in your notes, for example:

  • /wp-content/cache/
  • technical files inside the uploads folder
  • tag paths if you do not want to use them

The important part is not to accidentally block complete /wp-content/ paths that contain important CSS, JS or images. Otherwise your site will look different for crawlers than for real visitors.

Linking the sitemap in robots.txt

If you want to configure your robots.txt file correctly, the sitemap entry is almost always part of it:

Sitemap: https://www.yoursite.com/sitemap_index.xml

If you use several sitemaps you can list all of them. Many SEO plugins for WordPress automatically generate sitemaps, for example:

  • /sitemap_index.xml
  • /post-sitemap.xml
  • /page-sitemap.xml

The reference in robots.txt helps crawlers to discover your most important URLs quickly.

Meta tags – noindex, nofollow, dofollow explained

In addition to the robots.txt file there are meta tags you can use to control search engine behavior on page level.

A typical meta tag is:

<meta name="robots" content="noindex,follow">

The most important values:

  • index – the page may be indexed
  • noindex – the page should not appear in the search index
  • follow – links on the page may be used as ranking signals
  • nofollow – links on the page should not be used as ranking signals

“Dofollow” is not an official value. It is simply the default behavior if you use follow or do not specify anything.

Practical examples:

  • blog post with useful content: index,follow
  • internal search, weak filter pages: noindex,follow
  • sponsored links: add rel="sponsored" or rel="nofollow" to the link

robots.txt and meta tags belong together:

  • robots.txt controls whether a URL may be crawled.
  • the meta tag controls whether a crawled URL should appear in the index.

What is an llms.txt file?

With the rise of AI crawlers the question how to control content for AI models becomes more important. This is where the concept of an llms.txt file comes in.

The basic idea:

  • you create a file such as /llms.txt in the root of your website
  • in this file you describe how AI crawlers are allowed to use your content
  • you can highlight recommended areas, restrict some sections or explicitly ask not to use certain content

While the robots.txt file controls crawlers on URL level, the llms.txt file is more like a guideline and context file for large language models. It is not a hard standard yet, but it is increasingly discussed as an addition to the classic robots.txt approach.

You can also target specific AI crawlers directly in your robots.txt file, for example:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

This tells these bots that they are not allowed to crawl your site, for example to limit the use of your content in AI training.

Tips and tricks around robots.txt

A few practical tips:

Less is more
Keep the file tidy. A few targeted rules are better than a chaotic rule set.

Do not use robots.txt as a security feature
The robots.txt file is public. Anything that really needs to stay private belongs behind authentication or access restrictions, not in robots.txt.

Test changes
When you change your robots.txt file, use online tools or Google Search Console to check whether the rules behave as expected.

Secure staging environments separately
Staging sites should not only have a robots.txt file, but also be protected by a password.

Use a combination of tools
Configuring your robots.txt file correctly always means thinking in combination with meta tags, canonical tags, sitemaps and clean internal linking, not in isolation.

FAQ about robots.txt

What happens if I do not have a robots.txt file at all?
In that case crawlers are allowed to crawl everything that is reachable by default. For many small websites this is completely fine, but for shops and complex projects it can lead to unnecessary crawling.

Can I completely hide pages with robots.txt?
No. The robots.txt file is only a recommendation for crawlers. It does not protect against direct access and does not automatically prevent URLs from showing up in the index if they are linked from other sites.

Should I block images via robots.txt?
In most cases no. Images are an important part of SEO, especially in image search. The exception is internal or sensitive images that should not be public. These belong in protected areas.

How often do search engines read my robots.txt file?
Crawlers request the robots.txt file regularly. Changes are therefore usually taken into account relatively quickly.

Can I declare multiple sitemaps in robots.txt?
Yes. You can add several Sitemap: entries, for example for a blog, a shop and image sitemaps.

Glossary

robots.txt
Text file in the root of a domain that tells web crawlers which areas of the site they are allowed to crawl and which they should avoid.

Crawler / bot
Automated program that requests web pages, reads content and processes it for search engines or other services. Examples are Googlebot or Bingbot.

Crawl budget
Roughly describes how many URLs on a site a crawler will request in a certain period of time. The bigger and more complex the site, the more important it is not to waste crawl budget on unimportant URLs.

Meta tag noindex
Meta tag in the HTML head of a page that signals to crawlers that this page should not appear in the search index.

Meta tag nofollow
Meta tag in the HTML head (or attribute on a link) which tells crawlers that links on this page should not be used as ranking signals.

Sitemap
Usually an XML file that contains a list of important URLs on your website. It helps search engines find and understand your content more quickly.

llms.txt
Concept for an additional file in the root of the domain that contains guidelines specifically for AI crawlers and large language models.

User-agent
Name of the bot or browser. In robots.txt you use User-agent: to define which crawler a rule applies to.

Disallow
Directive in robots.txt that excludes paths from crawling.

Allow
Directive in robots.txt that explicitly allows paths to be crawled even if a higher level path is blocked.

Durchschnittliche Bewertung 0 / 5. Bewertungen: 0

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top