使用 Nginx 阻止AI机器人爬虫

我已经为我的 robots.txt 文件做了这个，但坦率地说，我不相信任何 AI 公司会尊重这一点。

robots.txt有点像要求机器人不要访问我的网站;使用 .htaccess ，您不会要求

从那时起，我就一直在我的清单上，想出如何使用 nginx 来做到这一点，然后昨天我决定偷懒，在 Mastodon 上询问是否有人知道如何做到这一点。Luke 向我推荐了这篇关于在 nginx 中使用 .htaccess 的文章（你不能这样做，因为它是 Apache 的一个功能），但它确实包含一个指向 .htaccess 到 nginx 转换器的链接，这让我朝着正确的方向前进^[1]。经过一番挖掘，有几种方法可以做到这一点。我可以在自己的块中阻止每个单独的机器人：

if ($http_user_agent = "BadBotOne") {
	return 403
}

或者更可取的是，我可以将它们全部包含在一个块中：

# case sensitive
if ($http_user_agent ~ (BadBotOne|BadBotTwo)) {
	return 403
}

# case insensitive
if ($http_user_agent ~* (BadBotOne|BadBotTwo)) {
	return 403
}

与 .htaccess 不同，我不能只在我的 Eleventy 站点中创建一个 nginx.conf 文件并完成它：nginx 配置文件不存在于它们所服务的站点的根目录中。事实证明，你可以在主 nginx.conf 中包含其他 conf 文件，这很方便。我在主 conf 文件中对重定向进行了快速测试，以确认它按预期工作：

# nginx.conf
server {
    include /home/forge/rknight.me/nginx.conf;
	# ... 
	# the rest of my nginx config
}

# nginx.conf file generated by 11ty
rewrite ^/thisisatestandnotarealpage /now permanent;

我想做的另一件事是不要在我的网站上公开此文件，但是如何在 Eleventy 中将文件的位置设置为公共文件夹的外部呢？在另一个转机时刻，如果你做永久链接：../nginx.conf （注意 ..）文件将从输出目录向上一级创建。因此，如果我们完整地看一下我的网站，全新的 nginx.conf 文件已经完全构建在我想要的地方：

cli
config
public <-- the directory my site builds to
src
+ nginx.conf
package-lock.json
package.json

我不想将此文件提交到版本控制，因此我将其添加到我的 .gitignore 中。

public
node_modules
+ nginx.conf

我已经从此存储库中提取了机器人数据以生成我的 robots.txt 文件，因此我只需要更新我的数据文件，以便为 nginx 配置提供正确格式的第二个数据版本。我还过滤掉了 AppleBot，因为我对这个的理解是它是一个搜索爬虫，与 AI 吞噬无关。

const res = await fetch("https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/main/robots.txt")
let txt = await res.text()

txt = txt.split("\n")
    .filter(line => line !== "User-agent: Applebot")
    .join("\n")

const bots = txt.split("\n")
    .filter(line => {
        return line.startsWith("User-agent:") && line !== "User-agent: Applebot"
    })
    .map(line => line.split(":")[1].trim())

const data = {
    txt: txt,
    nginx: bots.join('|'),
}

我添加了一个名为 nginx.conf.njk 的新文件，如下所示：

---
permalink: ../nginx.conf
eleventyExcludeFromCollections: true
---
# Block AI bots
if ($http_user_agent ~* "(AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot-Extended|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PerplexityBot|PetalBot|Scrapy|Sidetrade indexer bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot)"){
    return 403;
}

输出如下：

if ($http_user_agent ~* "(AdsBot-Google|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|AwarioRssBot|AwarioSmartBot|Bytespider|CCBot|ChatGPT-User|ClaudeBot|Claude-Web|cohere-ai|DataForSeoBot|Diffbot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GPTBot|img2dataset|ImagesiftBot|magpie-crawler|Meltwater|omgili|omgilibot|peer39_crawler|peer39_crawler/1.0|PerplexityBot|PiplBot|scoop.it|Seekr|YouBot)"){
    return 403;
}

作为这样做的奖励，我能够将我的所有博客文章移动到 /blog 目录下时添加我的所有重定向，以及用于显示漂亮的 RSS 提要的配置。这样，如果我重建服务器，我就不会失去对这些服务器的访问权限，并且它们保持在版本控制中，这要好得多。

为了检查它是否按预期工作，我在 Chrome 中设置了一个自定义用户代理 – 点击检查器中的三个点 > 更多工具 > 网络状况 > 用户代理。然后我将用户代理设置为 ClaudeBot，刷新我的网站，看到一个可爱的 403 禁止页面。

总之，去他妈的 AI 爬虫。

更新

如果我按照 Melanie 的建议重定向到 10GB 文件，那将是一个真正的遗憾

return 307 https://ash-speed.hetzner.com/10GB.bin;

米时空,主要提供资源、素材、音乐、视频、图片等一切与互联网有关的资源 zqhdz.com

使用 Nginx 阻止AI机器人爬虫

发表回复 取消回复

发表回复取消回复