Microsoft to Publishers: Stop Blocking AI Bots or Your Business Is Invisible

#aicrawlers #microsoft #publishercontentmark #aivisibility

Originally published on The Searchless Journal

Nikhil Kolar, VP of Publisher Product at Microsoft AI, delivered a blunt message to the publishing industry at AdExchanger's Prog AI event in Las Vegas. Four out of five websites block AI bots and crawlers. Their content, their products, their entire digital presence is invisible to the AI engines that hundreds of millions of people now use to discover information and make decisions.

"Your business is closed," Kolar said. Not metaphorically. Literally. When an AI model like Copilot, ChatGPT, or Perplexity tries to retrieve information about a blocked website, it gets nothing. No product data, no reviews, no expertise, no brand signals. The website might as well not exist.

For publishers, retailers, and anyone who depends on digital discovery, Kolar's statement crystallizes the central strategic dilemma of the AI era. Do you open your site to AI crawlers and risk having your content scraped, summarized, and repackaged without compensation? Or do you block the crawlers and accept irrelevance in the fastest-growing discovery channel since social media?

The answer, it turns out, depends entirely on who you ask.

The Scale of the Blocking Problem

Kolar's 4-out-of-5 statistic is striking, but it aligns with other data points emerging across the industry. As AI search engines have grown, website owners have responded by deploying increasingly aggressive blocking strategies.

The most common approach is robots.txt directives that target known AI crawlers by user agent. GPTBot, Google-Extended, PerplexityBot, ClaudeBot, Bytespider, and dozens of other AI crawler identifiers are being added to robots.txt files across the web. Some publishers block all AI crawlers indiscriminately. Others selectively block specific bots while allowing others.

The motivation is understandable. Publishers have watched AI search engines grow by ingesting their content, and they have seen their traffic decline as AI-generated answers replace the need to click through to the source. Blocking the crawlers feels like the only leverage they have.

But the consequence is severe. When a website blocks AI crawlers, it does not just prevent its content from being used as training data. It prevents its content from being surfaced in AI-generated answers. A product review site that blocks PerplexityBot will not appear in Perplexity's answer citations. An ecommerce store that blocks GPTBot will not be recommended when someone asks ChatGPT for product suggestions. A publisher that blocks all AI crawlers will not appear in any AI-generated news summary or research synthesis.

In Kolar's framing, this is self-sabotage. The web is bifurcating into sites that are legible to AI and sites that are not. And the sites that are not legible are disappearing from the fastest-growing discovery channels on the internet.

Microsoft's Solution: The Publisher Content Marketplace

Microsoft's answer to the blocking problem is its Publisher Content Marketplace, a platform where publishers license their content to AI developers and get paid every time their data informs an AI inference.

The marketplace launched with People Inc. as its founding partner and has since expanded to eight publisher partners. Microsoft's goal, Kolar said, is to sign up "the entire open web." The idea is straightforward: instead of the adversarial dynamic where publishers block crawlers and AI companies scrape anyway, the marketplace creates a commercial relationship. Publishers get paid. AI companies get licensed data. Everyone wins.

Crucially, Microsoft distinguishes between "training" and "grounding." Training is the deep-data-pool process where AI models learn patterns from large volumes of text. Grounding is the real-time retrieval of current, trusted sources to inform specific AI-generated answers. Microsoft's marketplace focuses on grounding, not training. Publishers who participate are not giving away their content for model training. They are making it available for real-time citation and attribution in AI-generated responses.

This distinction matters because it addresses one of publishers' deepest fears: that licensing their content means giving AI companies a permanent license to reproduce their work. Grounding-based licensing is more like syndication. The content is cited and attributed in the moment, not absorbed into the model's permanent knowledge.

Kolar was also candid about the business model. "All computing runs on Azure," he said. The marketplace is not a charity project. When publishers license content through the platform, the AI inference that uses that content runs on Microsoft's cloud infrastructure. Every query, every citation, every grounded answer generates Azure compute revenue. "That makes it not a cost for Microsoft," Kolar said. "This is a business."

The Counter-Strategy: Block Everything, Then Negotiate

Not everyone at Prog AI was buying Kolar's open-and-license approach. Jonathan Roberts, Chief Innovation Officer at People Inc., presented a very different strategy: block everything first, then selectively unblock to negotiate licensing deals.

People Inc. blocks 30,000 to 35,000 crawlers per day, Roberts said. The company allows only 38 specific crawlers to access its content. This hyper-aggressive blocking is not about protecting content from AI. It is about creating leverage.

Roberts' logic is straightforward. If you leave your content open to all crawlers, AI companies can access it for free and have no incentive to negotiate a licensing deal. But if you block everything, you create scarcity. AI companies that want your content have to come to you and negotiate terms. You control the conversation.

This is the same logic that drove the New York Times' lawsuit against OpenAI. Publishers with high-value, unique content have leverage. Publishers with commodity content that can be easily replicated do not. The blocking strategy works for the former. The open strategy works for the latter.

The tension between Kolar's "open up" and Roberts' "block and negotiate" is the defining strategic debate in AI visibility right now. There is no single right answer. The optimal strategy depends on the type of content, the competitive position of the publisher, and the value of the content to AI engines.

What This Means for Different Types of Businesses

The blocking question plays out very differently depending on what kind of site you run.

Publishers with premium content (news organizations, research firms, industry analysts) have the most leverage. Their content is unique, timely, and difficult to replicate. For these organizations, Roberts' block-and-negotiate strategy makes sense. The New York Times, Wall Street Journal, and similar publishers can extract licensing fees because AI engines need their content to produce high-quality answers.

Ecommerce retailers have less leverage but more to lose. Product pages are not unique. Thousands of retailers sell the same products with similar descriptions. If Amazon blocks AI crawlers, AI engines will just recommend Walmart or Target instead. For retailers, being visible to AI shopping agents is existential. Blocking is not a viable strategy. Optimization is the only option.

B2B companies and SaaS providers occupy a middle ground. Their content is more differentiated than ecommerce but less unique than premium publishers. Comparison pages, pricing pages, and technical documentation are exactly the type of content that AI search engines surface in answer to commercial queries. Blocking this content means losing visibility at the moment of purchase consideration.

Local businesses are the most vulnerable to the blocking trap. A restaurant that blocks Google's AI crawler will not appear in AI-generated local search results. A dentist that blocks PerplexityBot will not be recommended when someone asks "best dentist near me." For local businesses, AI visibility is a direct revenue driver, and blocking is self-defeating.

The Technical Reality of AI Crawler Access

Beyond the strategic debate, there is a practical reality: blocking AI crawlers is technically harder than most people think.

The landscape of AI crawlers is fragmented and evolving rapidly. New crawlers appear monthly. Existing crawlers change their user agent strings. Some AI companies use third-party crawling services that do not identify themselves as AI-related. A robots.txt file that blocks GPTBot, PerplexityBot, and ClaudeBot today may be incomplete by next month.

There is also a growing gap between "crawling" and "knowledge." Even if a website blocks all direct crawlers, AI models may still have information about it from secondary sources. Third-party databases, social media mentions, review sites, and knowledge graph entries all contribute to what an AI model knows about a brand. Blocking direct crawling reduces visibility but does not eliminate it entirely.

This means that the binary choice between "open" and "blocked" is a false dichotomy. The real question is not whether to block AI crawlers but how to manage what they find when they visit your site. Content strategy, structured data, entity clarity, and technical optimization all matter more than the robots.txt file alone.

The Path Forward

The AI visibility landscape is still in its early stages. Microsoft's Publisher Content Marketplace is one model, but it is not the only one. OpenAI has its own licensing deals. Google has been more aggressive about using publicly available content for AI Overviews, which has drawn publisher complaints. Perplexity has introduced a publisher program with ad revenue sharing.

For website owners, the practical takeaway is clear: the decision about AI crawler access is not a one-time choice. It is an ongoing strategic calculation that depends on your content type, competitive position, and business model.

Start by auditing your current crawler access. Which AI crawlers can reach your site? Which are blocked? What content are they finding? Use tools that measure your AI visibility across major AI search engines to understand where you appear and where you are missing.

Then make an informed decision. If you have unique, high-value content, consider the block-and-negotiate approach. If you are an ecommerce retailer or local business, optimize for AI discovery. If you are a B2B company, focus on the content types that AI search engines surface most often: comparison pages, technical documentation, and thought leadership.

The worst option is inaction. Kolar's warning is worth taking seriously: four out of five websites are invisible to AI engines. If your site is one of them, your competitors who chose visibility are happy about your decision.

Is AI discovery blocked on your site? Run a free AI visibility audit to find out which AI crawlers can access your content and where your brand appears in AI-generated answers.