nepenthes/website/version-1-documentation.md

240 lines
9.5 KiB
Markdown

Nepenthes 1.x Documentation
===========================
Usage
-----
Expected usage is to hide the tarpit behind nginx or Apache, or whatever
else you have implemented your site in. Directly exposing it to the
internet is ill advised. We want it to look as innocent and normal as
possible; in addition HTTP headers are used to configure the tarpit.
I'll be using nginx configurations for examples. Here's a real world
snippet for the demo above:
location /nepenthes-demo/ {
proxy_pass http://localhost:8893;
proxy_set_header X-Prefix '/nepenthes-demo';
proxy_set_header X-Forwarded-For $remote_addr;
proxy_buffering off;
}
You'll see several headers are added here: "X-Prefix" tells the tarpit
that all links should go to that path. Make this match what is in the
'location' directive. X-Forwarded-For is optional, but will make any
statistics gathered significantly more useful.
The proxy_buffering directive is important. LLM crawlers typically
disconnect if not given a response within a few seconds; Nepenthes
counters this by drip-feeding a few bytes at a time. Buffering breaks
this workaround.
You can have multiple proxies to an individual Nepenthes instance;
simply set the X-Prefix header accordingly.
Installation
------------
You can use Docker, or install manually.
A Dockerfile and compose.yaml is provided in the
[/docker directory.](https://svn.zadzmo.org/repo/nepenthes/head/docker/)
Simply tweak the configuration file to your preferences, 'docker compose up'.
You will still need to bootstrap a Markov corpus if you enable the
feature (see next section.)
For Manual installation, you'll need to install Lua (5.4 preferred),
SQLite (if using Markov), and OpenSSL. The following Lua modules need to
be installed - if they are all present in your package manager, use
that; otherwise you will need to install
[Luarocks](https://luarocks.org/) and use it to install the following:
- [cqueues](https://luarocks.org/modules/daurnimator/cqueues)
- [ossl](https://luarocks.org/modules/daurnimator/luaossl) (aka luaossl)
- [lpeg](https://luarocks.org/modules/gvvaughan/lpeg)
- [lzlib](https://luarocks.org/modules/hisham/lzlib)
(or [lua-zlib](https://luarocks.org/modules/brimworks/lua-zlib), only one of the two needed)
- [dbi-sqlite3](https://luarocks.org/modules/sparked435/luadbi-sqlite3) (aka luadbi-sqlite3)
- [unix](https://luarocks.org/modules/daurnimator/lunix) (aka lunix)
Create a nepenthes user (you REALLY don't want this running as root.)
Let's assume the user's home directory is also your install directory.
useradd -m nepenthes
Unpack the tarball:
cd scratch/
tar -xvzf nepenthes-1.0.tar.gz
cp -r nepenthes-1.0/* /home/nepenthes/
Tweak config.yml as you prefer (see below for documentation.) Then
you're ready to start:
su -l -u nepenthes /home/nepenthes/nepenthes /home/nepenthes/config.yml
Sending SIGTERM or SIGINT will shut the process down.
Bootstrapping the Markov Babbler
--------------------------------
The Markov feature requires a trained corpus to babble from. One was
intentionally omitted because, ideally, everyone's tarpits should look
different to evade detection. Find a source of text in whatever language
you prefer; there's lots of research corpuses out there, or possibly
pull in some very long Wikipedia articles, maybe grab some books from
Project Gutenberg, the Unix fortune file, it really doesn't matter at
all. Be creative!
Training is accomplished by sending data to a POST endpoint. This only
needs to be done once. Sending training data more than once cumulatively
adds to the existing corpus, allowing you to mix different texts - or
train in chunks.
Once you have your body of text, assuming it's called corpus.txt, in
your working directory, and you're running with the default port:
curl -XPOST -d @./corpus.txt -H'Content-type: text/plain' http://localhost:8893/train
This could take a very, VERY long time - possibly hours. curl may
potentially time out. See
[load.sh](https://svn.zadzmo.org/repo/nepenthes/head/load.sh) in the
nepenthes distribution for a script that incrementally loads training
data.
The Markov module returns an empty string if there is no corpus. Thus,
the tarpit will continue to function as a tarpit without a corpus
loaded. The extra CPU consumed for this check is almost nothing.
If you desire to delete the markov corpus and start over, that is simply
done with curl to the same endpoint using the DELETE method:
curl -XDELETE http://localhost:8893/train
Statistics
----------
Want to see what prey you've caught? There are several statistics
endpoints, all returning JSON. To see everything:
http://{http_host:http_port}/stats
To see user agent strings only:
http://{http_host:http_port}/stats/agents
Or IP addresses only:
http://{http_host:http_port}/stats/ips/
These can get quite big; so it's possible to filter both 'agents' and
'ips', simply add a minimum hit count to the URL. For example, to see a
list of all IPs that have visted more than 100 times:
http://{http_host:http_port}/stats/ips/100
Simply curl the URLs, pipe into 'jq' to pretty-print as desired. Script
away!
New in v1.1, there's a corpus statistics endpoint, if you're curious how
big it is, and don't want to fumble around with SQLite:
http://{http_host:http_port}/stats/markov
Nepenthes used Defensively
--------------------------
A link to a Nepenthes location from your site will flood out valid URLs
within your site's domain name, making it unlikely the crawler will
access real content.
In addition, the aggregated statistics will provide a list of IP
addresses that are almost certainly crawlers and not real users. Use
this list to create ACLs that block those IPs from reaching your content
- either return 403, 404, or just block at the firewall level.
Integration with fail2ban or blocklistd (or similar) is a future
possibility, allowing realtime reactions to crawlers, but not currently
implemented.
Using Nepenthes defensively, it would be ideal to turn off the Markov
module, and set both max_delay and min_delay to something large, as a
way to conserve your CPU.
Enforcing robots.txt
--------------------
I get asked this a lot: yes, this is a valid use case. It's not what I
intended to do (cause AI companies pain), which is a very different
thing than making bots respect your robots.txt. But it works nicely when
applied.
Just add:
User-agent: \*
Disallow: /nepenthes-demo
To your robots.txt, and those that respect the rules will stay out. Then
your IP statistics can be used as a banlist to save your resources.
Nepenthes used Offensively
--------------------------
Let's say you've got horsepower and bandwidth to spare, and just want to
see these AI models burn. Nepenthes has what you need:
Don't make any attempt to block crawlers with the IP stats. Put the delay
times as low as you are comfortable with. Train a big Markov corpus and
leave the Markov module enabled, set the maximum babble size to something
big. In short, let them suck down as much bullshit as they have diskspace
for and choke on it.
Advanced
--------
As of v1.1, Nepenthes can listen on a unix domain socket instead of
binding to a host and port. Set the argument 'unix_socket' in
config.yaml to the path to bind to.
Note, you MUST set the X-Forwarded-For or similar header in the upstream
proxy! Nepenthes will malfunction without it when listening to a unix
socket.
This feature has not been rigorously tested, use caution.
Configuration File
------------------
All possible directives in config.yaml:
- http_host : sets the host that Nepenthes will listen on; default is localhost only.
- http_port : sets the listening port number; default 8893
- unix_socket: sets a path to a unix domain socket to listen on. Default is nil.
- prefix: Prefix all generated links should be given. Can be overriden with the X-Prefix HTTP header. Defaults to nothing.
- templates: Path to the template files. This should be the '/templates' directory inside your Nepenthes installation.
- detach: If true, Nepenthes will fork into the background and redirect logging output to Syslog.
- pidfile: Path to drop a pid file after daemonization. If empty, no pid file is created.
- max_wait: Longest amount of delay to add to every request. Increase to slow down crawlers; too slow they might not come back.
- min_wait: The smallest amount of delay to add to every request. A random value is chosen between max_wait and min_wait.
- real_ip_header: Changes the name of the X-Forwarded-For header that communicates the actual client IP address for statistics gathering.
- prefix_header: Changes the name of the X-Prefix header that overrides the prefix configuration variable.
- forget_time: length of time, in seconds, that a given user-agent can go missing before being deleted from the statistics table.
- forget_hits: A user-agent that generates more than this number of requests will not be deleted from the statistics table.
- persist_stats: A path to write a JSON file to, that allows statistics to survive across crashes/restarts, etc
- seed_file: Specifies location of persistent unique instance identifier. This allows two instances with the same corpus to have different looking tarpits.
- words: path to a dictionary file, usually '/usr/share/dict/words', but could vary depending on your OS.
- markov: Path to a SQLite database containing a Markov corpus. If not specified, the Markov feature is disabled.
- markov_min: Minimum number of words to babble on a page.
- markov_max: Maximum number of words to babble on a page. Very large values can cause serious CPU load.