nepenthes/website/FAQ.md

170 lines
7.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Frequently Asked Questions / Common Comments
============================================
I get emailed about these so often I decided it's time to make a page
about it. More to come as it happens / as I remember.
But AI is the future!
---------------------
[First tech bubble, huh?](https://www.wheresyoured.at/longcon/)
Seriously, (a) these AI things [don't really work](https://pivot-to-ai.com/2025/05/13/if-ai-is-so-good-at-coding-where-are-the-open-source-contributions/)
, and (b) they have significant ethical and [ecological](https://www.greenmatters.com/big-impact/how-much-water-does-ai-use)
problems and (c) are causing [real harm](https://www.rollingstone.com/culture/culture-features/ai-spiritual-delusions-destroying-human-relationships-1235330175/)
to people with [little benefit.](https://www.livescience.com/technology/artificial-intelligence/using-ai-reduces-your-critical-thinking-skills-microsoft-study-warns)
But what about the [economic](https://www.wheresyoured.at/wheres-the-money/)
effects? I'll just quote Brian Merchant, who seems to be saying,
[That doesn't look good either!](https://www.bloodinthemachine.com/p/the-ai-bubble-is-so-big-its-propping)
>" Its to the point that were well past dot com boom levels of investment, and, as Kedrosky points out, approaching railroad-levels of investment, last seen in the days of the robber barons.
>
> I have no idea whats going to happen next. But if AI investment is so massive that its quite actually helping to prop up the US economy in a time of growing stress, what happens if the AI stool does get kicked out from under it all? "
You can't stop AI!
------------------
I am unsure about effectiveness of Nepenthes; but I can see the way crawlers have
reacted, changed their patterns since January 2025, that they see this
as a problem. I don't know if they think tarpitting and poisoning are
a minor nuisance or an existential threat. They are definitely
reacting though.
And there's good reason for them to react: none of these companies doing actual
AI stuff, as opposed to selling hardware for AI stuff, are profitable (see above.)
Nepenthes raises their costs.
And if we keep pushing, we can make
[Sam Altman](https://dair-community.social/@timnitGebru/114230667735623641)
or [Elon Musk](https://www.newsweek.com/elon-musk-tesla-protest-hate-2046937)
cry. And who doesn't want that? Sad billionaires are funny.
Hasn't this been done before?
-----------------------------
I never claimed to be the first. I remember infinite websites from the
early 2000's; my inspiration is the [anti-spam SMTP tarpitting](https://www.benzedrine.ch/relaydb.html)
that was briefly
popular a short time after. I've since been made aware of some others,
who even predated Nepenthes, using these same tactics against AI crawlers.
Nepenthes is simply the first that went viral for this particular
use case.
Is this actually effective at stopping crawlers?
------------------------------------------------
Honestly, it varies. I've gotten reports from multiple site operators, that
Nepenthes saved them from going offline. Sometimes the crawler gets contained;
recently I'm hearing that their site stops being crawled after Nepenthes
slows them down enough.
I've also heard reports of crawlers doubling down, becoming even more overwhelming
in response. These unlucky ones usually go offline, or switch to something like
[Anubis](https://anubis.techaro.lol/).
I think the differentiator is how badly the crawler wants your data. Sites considered
medium or low value to scrape, Nepenthes is quite effective at getting them to leave
you alone. If you have something they consider very high value: well..
The sad truth is, [If a dedicated adversary has resources on par with all Azure
at their disposal](https://vercel.com/blog/the-rise-of-the-ai-crawler), and/or
[is willing to use quasi-legal botnets](https://social.wildeboer.net/@jwildeboer/114358972839151578) to attack you,
it becomes a matter of who has more cash to spend - and until the bubble finally
pops, OpenAI and Anthropic definitely have more than you.
This is easy for crawlers to detect!
------------------------------------
At the time I released Nepenthes v1.0, I had logged nearly 30 million
hits from Facebook, 5 million from Claude, 3 million from Google, and
several hundred thousand from OpenAI.
If it's easy, at least as of January 2025, they weren't even trying.
Doesn't this just increase energy consumption?
----------------------------------------------
On the AI side, probably. I hope so, anyway. Again,
none of these companies are profitable: Adding to their costs will
hopefully cause them to fail sooner. There's only so much cash an
investor will put in without a return before they give up and pull
funding.
A rational investor, anyway. I'm sure some aren't, thankfully those
tend to go bankrupt.
As for my energy consumption: my busiest Nepenthes instance was using
about the same as a Raspberry Pi 3. Meh.
As I was quoted in Ars Technica, and I stand by this: "If I do nothing,
AI models, they boil the planet. If I switch this on, they boil the
planet. How is that my fault?"
Doesn't this cost you a lot of money in bandwidth overages?
-----------------------------------------------------------
No.
We live in a era with multi-hundred gigabit internet backbones.
Bandwidth is cheap now. Typically a terabyte of transfer is included
in a $5/month VPS plan. Thank video streaming services for pushing so
much high def that capacity got overbuilt.
I suppose you could tune for maximum throughput and do some damage to
your wallet, but text is so much smaller than multimedia it has not been
an issue.
Won't crawlers adapt by discarding slow websites?
-------------------------------------------------
That'd be great! Please do that!
Isn't this illegal?!
--------------------
No, it's not illegal. I made a website. They send HTTP requests to it,
it responds with fully HTTP complaint responses like all other websites.
If that's a problem for them, they should simply stop sending more requests.
(Spoiler alert: They don't.)
If you call Grampa Simpson trying to sell him something, and he just
talks your ear off without letting you get a word in, who's fault is it
if you stay on the line? Did Grampa do something illegal by being terrible
at conversation?
Why not use an LLM to generate the bullshit that gets sent back?
----------------------------------------------------------------
Markov Models are very simple - they get good enough quality for poison
and use very little CPU by comparison to AI techniques.
Remember, all an LLM does, is try to predict the next most likely word.
Which is also what a Markov model does. The output of a Markov babbler
is stastically identical to the input the model was trained on - by
definition. The semantic information was discarded on the way, but IMHO
that's a feature for this use case.
This is useless, the AI companies just filter out garbage data!
---------------------------------------------------------------
Then why are they spending so much time and resources pulling anything they can
from the internet, often so fast it overwhelms small websites? Why are they paying
money to crawl whatever they find, and then paying money to store it, only to throw
it all out?
[I want them to stop crushing websites.](https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/)
Even if the models survive the poison, that alone would be a win.