nepenthes/README.md
2025-09-10 16:12:55 +00:00

556 lines
18 KiB
Markdown

Nepenthes
=========
This is a tarpit intended to catch web crawlers. Specifically, it
targets crawlers that scrape data for LLMs - but really, like the
plants it is named after, it'll eat just about anything that finds it's
way inside.
It works by generating an endless sequences of pages, each of which with
dozens of links, that simply go back into a the tarpit. Pages are
randomly generated, but in a deterministic way, causing them to appear
to be flat files that never change. Intentional delay is added to
prevent crawlers from bogging down your server, in addition to wasting
their time. Lastly, Markov-babble is added to the pages, to give the
crawlers something to scrape up and train their LLMs on, hopefully
accelerating model collapse.
[You can take a look at what this looks like, here. (Note: VERY slow page loads!)](https://zadzmo.org/nepenthes-demo)
WARNING
=======
THIS IS DELIBERATELY MALICIOUS SOFTWARE INTENDED TO CAUSE HARMFUL
ACTIVITY. DO NOT DEPLOY IF YOU AREN'T FULLY COMFORTABLE WITH WHAT YOU
ARE DOING.
ANOTHER WARNING
===============
LLM scrapers are relentless and brutal. You may be able to keep them at
bay with this software; but it works by providing them with a
neverending stream of exactly what they are looking for. YOU ARE LIKELY
TO EXPERIENCE SIGNIFICANT CONTINUOUS CPU LOAD.
Great effort has been taken to make Nepenthes more performant and use
the bare minimum of system resources, but it is still trivially easy
to misconfigure in a way that can take your server offline. This is
especially true if some of the agressive, less well behaved crawlers
find your instance.
YET ANOTHER WARNING
===================
There is not currently a way to differentiate between web crawlers that
are indexing sites for search purposes, vs crawlers that are training
AI models. ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR
FROM ALL SEARCH RESULTS.
So why should I run this, then?
===============================
So that, as I said to
[Ars Technica](https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/),
we can fight back. Make your website indigestible to scrapers and
grow some spikes.
Instead of rolling over and letting these assholes do what they want,
make them have to work for it instead.
Further questions? I made a [FAQ](/code/nepenthes/FAQ.md) page.
Latest Version
--------------
[Nepenthes 2.0](https://zadzmo.org/downloads/nepenthes/file/nepenthes-2.0.tar.gz)
[Docker Image](https://zadzmo.org/downloads/nepenthes/docker)
[Latest Permalink](https://zadzmo.org/downloads/nepenthes/latest)
[All downloads](https://zadzmo.org/downloads/nepenthes)
[RSS feed of releases](https://zadzmo.org/downloads/nepenthes/rss)
Installation
------------
You can use Docker, or install manually. The latest Dockerfile and
compose.yml can be found at the
[Download Manager.](https://zadzmo.org/downloads/nepenthes/docker/latest)
For Manual installation, you'll need to install Lua. Nepenthes makes use
of the <close> feature, so Lua 5.4 is required. OpenSSL is also needed
for cryptographic functions.
The following Lua modules need to be installed - if they are all present
in your OS's package manager, use that; otherwise you will need to install
Luarocks and use it to install the following:
- [cqueues](https://luarocks.org/modules/daurnimator/cqueues)
- [ossl](https://luarocks.org/modules/daurnimator/luaossl) (aka luaossl)
- [lpeg](https://luarocks.org/modules/gvvaughan/lpeg)
- [lzlib](https://luarocks.org/modules/hisham/lzlib)
(or [lua-zlib](https://luarocks.org/modules/brimworks/lua-zlib),
only one of the two needed)
- [unix](https://luarocks.org/modules/daurnimator/lunix) (aka lunix)
Create a nepenthes user (you REALLY don't want this running as root.)
Let's assume the user's home directory is also your install directory.
```sh
useradd -m nepenthes
```
Unpack the tarball:
```sh
cd scratch/
tar -xvzf nepenthes-2.0.tar.gz
cp -r nepenthes-2.0/* /home/nepenthes/
```
Tweak config.yml as you prefer (see below for documentation.) Then you're
ready to start:
```sh
su -l -u nepenthes /home/nepenthes/nepenthes /home/nepenthes/config.yml
```
Sending SIGTERM or SIGINT will shut the process down.
Webserver Configuration
-----------------------
Expected usage is to hide the tarpit behind nginx or Apache, or whatever
else you have implemented your site in. Directly exposing it to the
internet is ill advised. We want it to look as innocent and normal as
possible; in addition HTTP headers can be used to configure the tarpit.
I'll be using nginx configurations for examples. Here's a real world
snippet for the demo above:
```nginx
location /maze/ {
proxy_pass http://localhost:8893;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_buffering off;
}
```
The X-Forwarded-For header is technically optional, but will make your
statistics largely useless.
The proxy_buffering directive is important. LLM crawlers typically
disconnect if not given a response within a few seconds; Nepenthes
counters this by drip-feeding a few bytes at a time. Buffering breaks
this workaround.
Nepenthes versions 1.x used an X-Prefix header; this has been removed.
Nepenthes Configuration
-----------------------
A very simple configuration, that matches the above nginx configuration
block, could be:
```yaml
---
http_host: '::'
http_port: 8893
templates:
- '/usr/nepenthes/templates'
- '/home/nepenthes/templates'
seed_file: '/home/nepenthes/seed.txt'
min_wait: 10
max_wait: 65
silos:
- name: default
wordlist: '/usr/share/dict/words'
corpus: '/home/nepenthes/mishmash.txt'
prefixes:
- /maze
```
Most of the values should be self-explainatory. The 'silos' directive
is not optional (more on that later), however only one needs to be
defined.
Multiple template directories can be included, so you can bring your
own in from outside the Nepenthes distribution.
Multiple prefixes can be defined per silo. Sending a traffic with a
prefix that is not configured will likely fire the bogon filter, causing
Nepenthes to return a 404 HTTP status.
Markov
------
Nepenthes 2.0 and later keep the corpus entirely in memory; real world
testing shows this is a significant (40x) speedup with roughly the same
memory consumption, as SQLite used a significant amount of memory. The
only downside is startup time has increased, as it the corpus is
re-trained every time. For reasonable corpus sizes (60,000 lines or so)
on modern hardware, this training time at startup is several seconds.
Actual Markov parameters ( tokens generated, etc ) are now controlled
from within the templates.
Templates
---------
Template files consist of a two parts: A YAML prefix, and a Handlebars/
Lustache template. The
[default template](https://svn.zadzmo.org/repo/nepenthes/head/templates/default.lmt)
would be a good reference to look.
The 'markov', 'link_array', and 'link' sections in the YAML portion are
used to define variables that are passed to the templating engine.
- markov: Fills a variable with markov babble.
- name: Variable name passed to the template.
- min: Minimum number of 'tokens' - words, essentially - of markov slop to generate.
- max: Maximum number of tokens
- link_array: Creates a variable sized array of links.
- min_count: Size of the smallest list of links to generate
- max_count: Maximum number of links in the array
- depth_min: The number smallest of words (from the given wordlist) to put into a URL,
ie, '/toque/Messianism/narrowly' has a depth of three.
- depth_max: The largest number of words
- link: Creates a single named link.
- name: Variable name passed to the template.
- depth_min: The smallest number of words to put into the URL
- depth_max: The largest number of words to put into the URL
The second portion of the template file is a Lustache template; you
can find detailed documentation at
[Lustache's website](https://olivinelabs.com/lustache/).
Statistics
----------
Nepenthes 2.0 and later do not store persistent statistics. The focus
is now on presenting a snapshot in time; the intent is to offload
detailed analysis to tools intended for such purposes such as an
external SQL database. The configuration variable stats_remember_time
sets the time horizon and defaults to one hour.
The top level /stats gives a broad overview, here's a real example
as I'm writing this:
```sh
curl http://localhost:8893/stats | jq
```
```json
{
"addresses": 1850,
"unsent_bytes_percent": 0.13952029418754,
"hits": 10015,
"agents": 145,
"unsent_bytes": 20585,
"cpu_percent": 1.7462639725424,
"delay": 56020.624358161,
"active": 25,
"bytes_sent": 14733541,
"uptime": 299516,
"memory_usage": 210422861,
"cpu_total": 5230.34,
"bytes_generated": 14754126,
"cpu": 10.335670266766,
"bogons": 4
}
```
Here we see, the past hour, 1850 distinct clients ('addresses') reached
the tarpit and made 10015 requests ('hits'), and presented 145 different
user-agent strings ('agents'). They were sent 14 megabytes of trash
('bytes_sent') and collectively waited 56020 seconds - 15 hours! - to
get said garbage ('delay').
There 25 active connections being served ('active') up slop as we speak,
and 20 kilobytes of slop that has been generated already but not sent
('unsent_bytes'). 'unsent_bytes_percent' is intended to be a gauge of
the effectiveness of the delay times: here it's less than 1 percent. If
unsent_bytes_percent rises significantly, it means crawlers are
routinely disconnecting before the request is finished.
Four requests ('bogons') asked for a URL that couldn't possibly be
generated by the configured word list.
To serve these 10015 requests, Nepenthes utilized the CPU for 10 seconds
of the previous hour ('cpu'), computed to be 1.74% of the CPU available
('cpu_percent'). (This isn't intended to be a precise metric; it doesn't
take into account multiple CPU's, and Nepenthes can only utilize one
currently.) 'cpu_total' is as reported by the Lua runtime since Nepenthes
was started; which was 299516 seconds ago ('uptime').
Memory used is 200 megabytes ('memory_usage'), as reported by the Lua
garbage collector.
Speaking of garbage collection: in some cases such as abnormal ends to
an HTTP transaction, like a client disconnecting before receiving all
data, the 'active' metric can read higher than is real. It should
eventually self correct as the garbage collector does it's job.
If you want to see actual agent strings or IP address information, that
can be returned as well:
```sh
curl http://localhost:8893/stats/agents | jq
```
```json
{
"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36": 289,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Safari/605.1.15": 3,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36": 3,
"Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)": 516,
"Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)": 480,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:137.0) Gecko/20100101 Firefox/137.0": 9,
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:136.0) Gecko/20100101 Firefox/136.0": 2,
"Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)": 146
}
```
```sh
curl http://localhost:8893/stats/addresses | jq
```
```json
{
"3.224.205.25": 7,
"44.207.207.36": 9,
"54.89.90.224": 8,
"66.249.64.9": 24,
"2a03:2880:f806::": 2,
"2a03:2880:f806:8::": 2,
"2a03:2880:f806:29::": 2,
"94.74.85.29": 2,
"111.119.233.225": 1
}
```
Want to see the raw data? There's an endpoint for that too.
```sh
curl http://localhost:8893/stats/buffer | jq
```
```json
[
{
"address": "fda0:bb68:b812:d00d:8aae:ddff:fe42:62fe",
"complete": true,
"when": 886257.9491666,
"delay": 19.389002208016,
"cpu": 0.019220833084546,
"uri": "/maze/mingelen/sipe/piles/suaharo",
"id": "1757365444.1",
"silo": "default",
"bytes_generated": 1188,
"agent": "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0",
"response": 200,
"bytes_sent": 1188
},
{
"address": "fda0:bb68:b812:d00d:8aae:ddff:fe42:62fe",
"complete": true,
"when": 886291.34533826,
"delay": 16.663
...etc
```
To facilite data export to an analysis system, the "ID" parameter is
unique to all requests even if Nepenthes is restarted. You can ask
Nepenthes to only send data _after_ a specific ID, preventing duplicates
from being imported:
```sh
curl http://localhost:8893/stats/buffer/from/1757365444.1 | jq
```
Silos
-----
Version 2.0 Provides for Silos, which work similarly in concept to
virtual hosts on a web server. Each silo can have it's own
configuration, including Markov corpus, wordlist, delay times,
statistics, template, etc. This is specified in the configuration YAML.
```yaml
silos:
- name: fast
corpus: /vol/nepenthes/corpus-1.txt
wordlist: /vol/nepenthes/words-1.txt
default: true
min_delay: 15
max_delay: 20
prefixes:
- /maze
- name: slow
corpus: /vol/nepenthes/corpus-2.txt
wordlist: /vol/nepenthes/words-2.txt
template: slowerpage
min_delay: 200
max_delay: 300
prefixes:
- /theslowhole
```
Silos can share a markov corpus - or also have separate ones. Simply
specify the same filename to share a corpus; Nepenthes won't train it
twice. The same is true of wordlists used to build URLs.
The header X-Silo is used to signify which silo the incoming request
should be put into:
```nginx
location /maze/ {
proxy_pass http://localhost:8893;
proxy_set_header X-Silo 'fast';
proxy_set_header X-Forwarded-For $remote_addr;
proxy_buffering off;
}
location /theslowhole/ {
proxy_pass http://localhost:8893;
proxy_set_header X-Silo 'slow';
proxy_set_header X-Forwarded-For $remote_addr;
proxy_buffering off;
}
```
If the X-Silo header is not present, the request will be placed in
the default silo, marked by the 'default' boolean in the configuration
above. Specifying more than one default will cause an error on startup.
If a default silo is not specified, the first silo listed in the
configuration will be assumed to be the default one.
Statics can be filtered on a per-silo basis:
```sh
curl http://localhost:8893/stats/silo/slow | jq
```
```sh
curl http://localhost:8893/stats/silo/slow/agents | jq
```
```sh
curl http://localhost:8893/stats/silo/slow/addresses | jq
```
Configuration File Reference
----------------------------
All possible directives in config.yaml:
- http_host : sets the host that Nepenthes will listen on; default is
localhost only.
- http_port : sets the listening port number; default 8893
- unix_socket: sets a path to a unix domain socket to listen on.
Default is nil. If specified, will override http_host and http_port,
and only listen on Unix domain sockets.
- nochdir: If true, do not change directory after daemonization.
Default is false. Normally only used for development/debugging as it
allows for relative paths in the configuration, but is bad practice
(daemons should, in fact, chdir to '/' after forking.
- templates: Paths to the template files. This should include the
'/templates' directory inside your Nepenthes installation, and any
other directories that contain templates you want to use.
- detach: If true, Nepenthes will fork into the background and redirect
logging output to Syslog.
- log_level: Log message filtering; same priorties as syslog. Defaults
to 'info'.
- pidfile: Path to drop a pid file after daemonization. If empty, no
pid file is created.
- real_ip_header: Changes the name of the X-Forwarded-For header that
communicates the actual client IP address for statistics gathering.
- silo_header: Changes the name of the X-Silo header that controls silo
assignment.
- seed_file: Specifies location of persistent unique instance
identifier. This allows two instances with the same corpus to have
different looking tarpits. If not specified, the seed will not
persist, causing pages to change if Nepenthes is restarted.
- stats_remember_time: Sets how long entries remain in the rolling
stats buffer, in seconds. Defaults to 3600 (one hour.)
- min_wait: Default minimum delay time if not specified in a silo
configuration
- max_wait: Default maximum delay time if not specified in a silo
configuration.
- silos:
- name: Name of the silo, which is matched against the X-Silo header.
- template: Template file to use in this silo. Default is 'default',
included in the Nepenthes distribution.
- min_wait: Optional. Minimum delay time in this silo
- max_wait: Optional. Maximum delay time in this silo
- default: If set to 'true', marks this as the default silo.
- corpus: Path to a text file containing the markov corpus for training.
- wordlist: Path to a dictionary file for URL generation, eg,
/usr/share/dict/words
- prefixes: A list of URL prefixes that are valid for this silo.
License Info
------------
Nepenthes is distributed under the terms of the MIT License, see the
file 'LICENSE' in the source distribution. In addition, the release
tarball contains several 3rd party components, see external/README.
Using or distributing Nepenthes requires agreeing to these license
terms as well. As of v2.0, all are also MIT or X11 licenses; copies
may be find in external/license.
History
-------
Version numbers use a simple process: If the only changes are fully
backwards compatible, the minor number changes. If the user or
administrator needs to change anything after or part of the upgrade, the
major number changes and the minor number resets to zero.
[Legacy 1.x Documentation](https://zadzmo.org/code/nepenthes/version-1-documentation.md)
- v1.0:
- Initial release
- v1.1:
- Clearer licensing
- Small performance improvements
- Clearer logging
- Corpus reset
- Evasion countermeasures
- Corpus Statistics report endpoint
- Unix domain socket support
- v1.2:
- Bugfix in Bogon filter for UTF8 characters
- Fix rare crash with stacktrace
- v2.0:
- Total overhaul/refactor
- In-memory corpus
- Silos
- Rolling Stats buffer
- Expandable templates