aaron 6296a4fe3c Indicate bogon count in stats output.

2025-09-10 16:12:55 +00:00

18 KiB

Raw Permalink Blame History

Nepenthes

This is a tarpit intended to catch web crawlers. Specifically, it targets crawlers that scrape data for LLMs - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside.

It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, Markov-babble is added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.

You can take a look at what this looks like, here. (Note: VERY slow page loads!)

WARNING

THIS IS DELIBERATELY MALICIOUS SOFTWARE INTENDED TO CAUSE HARMFUL ACTIVITY. DO NOT DEPLOY IF YOU AREN'T FULLY COMFORTABLE WITH WHAT YOU ARE DOING.

ANOTHER WARNING

LLM scrapers are relentless and brutal. You may be able to keep them at bay with this software; but it works by providing them with a neverending stream of exactly what they are looking for. YOU ARE LIKELY TO EXPERIENCE SIGNIFICANT CONTINUOUS CPU LOAD.

Great effort has been taken to make Nepenthes more performant and use the bare minimum of system resources, but it is still trivially easy to misconfigure in a way that can take your server offline. This is especially true if some of the agressive, less well behaved crawlers find your instance.

YET ANOTHER WARNING

There is not currently a way to differentiate between web crawlers that are indexing sites for search purposes, vs crawlers that are training AI models. ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS.

So why should I run this, then?

So that, as I said to Ars Technica, we can fight back. Make your website indigestible to scrapers and grow some spikes.

Instead of rolling over and letting these assholes do what they want, make them have to work for it instead.

Further questions? I made a FAQ page.

Latest Version

Installation

You can use Docker, or install manually. The latest Dockerfile and compose.yml can be found at the Download Manager.

For Manual installation, you'll need to install Lua. Nepenthes makes use of the feature, so Lua 5.4 is required. OpenSSL is also needed for cryptographic functions.

The following Lua modules need to be installed - if they are all present in your OS's package manager, use that; otherwise you will need to install Luarocks and use it to install the following:

cqueues
ossl (aka luaossl)
lpeg
lzlib (or lua-zlib, only one of the two needed)
unix (aka lunix)

Create a nepenthes user (you REALLY don't want this running as root.) Let's assume the user's home directory is also your install directory.

useradd -m nepenthes

Unpack the tarball:

cd scratch/
tar -xvzf nepenthes-2.0.tar.gz
cp -r nepenthes-2.0/* /home/nepenthes/

Tweak config.yml as you prefer (see below for documentation.) Then you're ready to start:

su -l -u nepenthes /home/nepenthes/nepenthes /home/nepenthes/config.yml

Sending SIGTERM or SIGINT will shut the process down.

Webserver Configuration

Expected usage is to hide the tarpit behind nginx or Apache, or whatever else you have implemented your site in. Directly exposing it to the internet is ill advised. We want it to look as innocent and normal as possible; in addition HTTP headers can be used to configure the tarpit.

I'll be using nginx configurations for examples. Here's a real world snippet for the demo above:

location /maze/ {
	proxy_pass http://localhost:8893;
	proxy_set_header X-Forwarded-For $remote_addr;
	proxy_buffering off;
}

The X-Forwarded-For header is technically optional, but will make your statistics largely useless.

The proxy_buffering directive is important. LLM crawlers typically disconnect if not given a response within a few seconds; Nepenthes counters this by drip-feeding a few bytes at a time. Buffering breaks this workaround.

Nepenthes versions 1.x used an X-Prefix header; this has been removed.

Nepenthes Configuration

A very simple configuration, that matches the above nginx configuration block, could be:

---
http_host: '::'
http_port: 8893
templates:
  - '/usr/nepenthes/templates'
  - '/home/nepenthes/templates'
seed_file: '/home/nepenthes/seed.txt'

min_wait: 10
max_wait: 65

silos:
  - name: default
    wordlist: '/usr/share/dict/words'
    corpus: '/home/nepenthes/mishmash.txt'
    prefixes:
      - /maze

Most of the values should be self-explainatory. The 'silos' directive is not optional (more on that later), however only one needs to be defined.

Multiple template directories can be included, so you can bring your own in from outside the Nepenthes distribution.

Multiple prefixes can be defined per silo. Sending a traffic with a prefix that is not configured will likely fire the bogon filter, causing Nepenthes to return a 404 HTTP status.

Markov

Nepenthes 2.0 and later keep the corpus entirely in memory; real world testing shows this is a significant (40x) speedup with roughly the same memory consumption, as SQLite used a significant amount of memory. The only downside is startup time has increased, as it the corpus is re-trained every time. For reasonable corpus sizes (60,000 lines or so) on modern hardware, this training time at startup is several seconds.

Actual Markov parameters ( tokens generated, etc ) are now controlled from within the templates.

Templates

Template files consist of a two parts: A YAML prefix, and a Handlebars/ Lustache template. The default template would be a good reference to look.

The 'markov', 'link_array', and 'link' sections in the YAML portion are used to define variables that are passed to the templating engine.

markov: Fills a variable with markov babble.
- name: Variable name passed to the template.
- min: Minimum number of 'tokens' - words, essentially - of markov slop to generate.
- max: Maximum number of tokens
link_array: Creates a variable sized array of links.
- min_count: Size of the smallest list of links to generate
- max_count: Maximum number of links in the array
- depth_min: The number smallest of words (from the given wordlist) to put into a URL, ie, '/toque/Messianism/narrowly' has a depth of three.
- depth_max: The largest number of words
link: Creates a single named link.
- name: Variable name passed to the template.
- depth_min: The smallest number of words to put into the URL
- depth_max: The largest number of words to put into the URL

The second portion of the template file is a Lustache template; you can find detailed documentation at Lustache's website.

Statistics

Nepenthes 2.0 and later do not store persistent statistics. The focus is now on presenting a snapshot in time; the intent is to offload detailed analysis to tools intended for such purposes such as an external SQL database. The configuration variable stats_remember_time sets the time horizon and defaults to one hour.

The top level /stats gives a broad overview, here's a real example as I'm writing this:

curl http://localhost:8893/stats | jq

{
  "addresses": 1850,
  "unsent_bytes_percent": 0.13952029418754,
  "hits": 10015,
  "agents": 145,
  "unsent_bytes": 20585,
  "cpu_percent": 1.7462639725424,
  "delay": 56020.624358161,
  "active": 25,
  "bytes_sent": 14733541,
  "uptime": 299516,
  "memory_usage": 210422861,
  "cpu_total": 5230.34,
  "bytes_generated": 14754126,
  "cpu": 10.335670266766,
  "bogons": 4
}

Here we see, the past hour, 1850 distinct clients ('addresses') reached the tarpit and made 10015 requests ('hits'), and presented 145 different user-agent strings ('agents'). They were sent 14 megabytes of trash ('bytes_sent') and collectively waited 56020 seconds - 15 hours! - to get said garbage ('delay').

There 25 active connections being served ('active') up slop as we speak, and 20 kilobytes of slop that has been generated already but not sent ('unsent_bytes'). 'unsent_bytes_percent' is intended to be a gauge of the effectiveness of the delay times: here it's less than 1 percent. If unsent_bytes_percent rises significantly, it means crawlers are routinely disconnecting before the request is finished.

Four requests ('bogons') asked for a URL that couldn't possibly be generated by the configured word list.

To serve these 10015 requests, Nepenthes utilized the CPU for 10 seconds of the previous hour ('cpu'), computed to be 1.74% of the CPU available ('cpu_percent'). (This isn't intended to be a precise metric; it doesn't take into account multiple CPU's, and Nepenthes can only utilize one currently.) 'cpu_total' is as reported by the Lua runtime since Nepenthes was started; which was 299516 seconds ago ('uptime').

Memory used is 200 megabytes ('memory_usage'), as reported by the Lua garbage collector.

Speaking of garbage collection: in some cases such as abnormal ends to an HTTP transaction, like a client disconnecting before receiving all data, the 'active' metric can read higher than is real. It should eventually self correct as the garbage collector does it's job.

If you want to see actual agent strings or IP address information, that can be returned as well:

curl http://localhost:8893/stats/agents | jq

{
  "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36": 289,
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Safari/605.1.15": 3,
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36": 3,
  "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)": 516,
  "Mozilla/5.0 (compatible; BLEXBot/1.0; +https://help.seranking.com/en/blex-crawler)": 480,
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:137.0) Gecko/20100101 Firefox/137.0": 9,
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:136.0) Gecko/20100101 Firefox/136.0": 2,
  "Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)": 146
}

curl http://localhost:8893/stats/addresses | jq

{
  "3.224.205.25": 7,
  "44.207.207.36": 9,
  "54.89.90.224": 8,
  "66.249.64.9": 24,
  "2a03:2880:f806::": 2,
  "2a03:2880:f806:8::": 2,
  "2a03:2880:f806:29::": 2,
  "94.74.85.29": 2,
  "111.119.233.225": 1
}

Want to see the raw data? There's an endpoint for that too.

curl http://localhost:8893/stats/buffer | jq

[
  {
    "address": "fda0:bb68:b812:d00d:8aae:ddff:fe42:62fe",
    "complete": true,
    "when": 886257.9491666,
    "delay": 19.389002208016,
    "cpu": 0.019220833084546,
    "uri": "/maze/mingelen/sipe/piles/suaharo",
    "id": "1757365444.1",
    "silo": "default",
    "bytes_generated": 1188,
    "agent": "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "response": 200,
    "bytes_sent": 1188
  },
  {
    "address": "fda0:bb68:b812:d00d:8aae:ddff:fe42:62fe",
    "complete": true,
    "when": 886291.34533826,
    "delay": 16.663

...etc

To facilite data export to an analysis system, the "ID" parameter is unique to all requests even if Nepenthes is restarted. You can ask Nepenthes to only send data after a specific ID, preventing duplicates from being imported:

curl http://localhost:8893/stats/buffer/from/1757365444.1 | jq

Silos

Version 2.0 Provides for Silos, which work similarly in concept to virtual hosts on a web server. Each silo can have it's own configuration, including Markov corpus, wordlist, delay times, statistics, template, etc. This is specified in the configuration YAML.

silos:
  - name: fast
    corpus: /vol/nepenthes/corpus-1.txt
    wordlist: /vol/nepenthes/words-1.txt
    default: true
    min_delay: 15
    max_delay: 20
    prefixes:
      - /maze

  - name: slow
    corpus: /vol/nepenthes/corpus-2.txt
    wordlist: /vol/nepenthes/words-2.txt
    template: slowerpage
    min_delay: 200
    max_delay: 300
    prefixes:
      - /theslowhole

Silos can share a markov corpus - or also have separate ones. Simply specify the same filename to share a corpus; Nepenthes won't train it twice. The same is true of wordlists used to build URLs.

The header X-Silo is used to signify which silo the incoming request should be put into:

location /maze/ {
	proxy_pass http://localhost:8893;
	proxy_set_header X-Silo 'fast';
	proxy_set_header X-Forwarded-For $remote_addr;
	proxy_buffering off;
}

location /theslowhole/ {
	proxy_pass http://localhost:8893;
	proxy_set_header X-Silo 'slow';
	proxy_set_header X-Forwarded-For $remote_addr;
	proxy_buffering off;
}

If the X-Silo header is not present, the request will be placed in the default silo, marked by the 'default' boolean in the configuration above. Specifying more than one default will cause an error on startup. If a default silo is not specified, the first silo listed in the configuration will be assumed to be the default one.

Statics can be filtered on a per-silo basis:

curl http://localhost:8893/stats/silo/slow | jq

curl http://localhost:8893/stats/silo/slow/agents | jq

curl http://localhost:8893/stats/silo/slow/addresses | jq

Configuration File Reference

All possible directives in config.yaml:

http_host : sets the host that Nepenthes will listen on; default is localhost only.
http_port : sets the listening port number; default 8893
unix_socket: sets a path to a unix domain socket to listen on. Default is nil. If specified, will override http_host and http_port, and only listen on Unix domain sockets.
nochdir: If true, do not change directory after daemonization. Default is false. Normally only used for development/debugging as it allows for relative paths in the configuration, but is bad practice (daemons should, in fact, chdir to '/' after forking.
templates: Paths to the template files. This should include the '/templates' directory inside your Nepenthes installation, and any other directories that contain templates you want to use.
detach: If true, Nepenthes will fork into the background and redirect logging output to Syslog.
log_level: Log message filtering; same priorties as syslog. Defaults to 'info'.
pidfile: Path to drop a pid file after daemonization. If empty, no pid file is created.
real_ip_header: Changes the name of the X-Forwarded-For header that communicates the actual client IP address for statistics gathering.
silo_header: Changes the name of the X-Silo header that controls silo assignment.
seed_file: Specifies location of persistent unique instance identifier. This allows two instances with the same corpus to have different looking tarpits. If not specified, the seed will not persist, causing pages to change if Nepenthes is restarted.
stats_remember_time: Sets how long entries remain in the rolling stats buffer, in seconds. Defaults to 3600 (one hour.)
min_wait: Default minimum delay time if not specified in a silo configuration
max_wait: Default maximum delay time if not specified in a silo configuration.
silos:
- name: Name of the silo, which is matched against the X-Silo header.
- template: Template file to use in this silo. Default is 'default', included in the Nepenthes distribution.
- min_wait: Optional. Minimum delay time in this silo
- max_wait: Optional. Maximum delay time in this silo
- default: If set to 'true', marks this as the default silo.
- corpus: Path to a text file containing the markov corpus for training.
- wordlist: Path to a dictionary file for URL generation, eg, /usr/share/dict/words
- prefixes: A list of URL prefixes that are valid for this silo.

License Info

Nepenthes is distributed under the terms of the MIT License, see the file 'LICENSE' in the source distribution. In addition, the release tarball contains several 3rd party components, see external/README. Using or distributing Nepenthes requires agreeing to these license terms as well. As of v2.0, all are also MIT or X11 licenses; copies may be find in external/license.

History

Version numbers use a simple process: If the only changes are fully backwards compatible, the minor number changes. If the user or administrator needs to change anything after or part of the upgrade, the major number changes and the minor number resets to zero.

Legacy 1.x Documentation

v1.0:
- Initial release
v1.1:
- Clearer licensing
- Small performance improvements
- Clearer logging
- Corpus reset
- Evasion countermeasures
- Corpus Statistics report endpoint
- Unix domain socket support
v1.2:
- Bugfix in Bogon filter for UTF8 characters
- Fix rare crash with stacktrace
v2.0:
- Total overhaul/refactor
- In-memory corpus
- Silos
- Rolling Stats buffer
- Expandable templates

18 KiB Raw Permalink Blame History