Gitea Scraping Captcha
Blocking stupid scraper bots with a JavaScript challenge, served directly from nginx.
Web scraping is a controversial topic, and by that I mean everyone enjoys doing it and hates having to done to them. Personally, I last got to pick up bs4 before the AI boom, so I will consider myself not a problem. Additionally, I never wrote a spider. This is about a spider.
For some time now, my Gitea instance has been getting inundated with requests from a lot of really repetitive link-following scrapers. As a webapp, Gitea uses and generates a ton of internal links - every file, in every commit, as a blame, as a diff, all diffs in two formats, then various forms of issue search filtering, et cetera. While this is convenient to browse, scrapers and spiders following these links get stuck in the resulting mess like a human on Wikipedia armed with the middle click. Even more annoyingly, the requests come from different IPs, making throttling non-trivial. Mind you, it’d be fine if the scrapers just grabbed the code and fucked off, but the nonsensical behaviour just implies that they follow links blindly and don’t care about the actual code.
124.243.xxx.xxx - - [06/Mar/2025:11:58:24 +0100] "GET /szymonszl/OfflineSkins-HSPatched/commit/ad6461754f57a72f8a06b92fc63b046f55c7f126.diff HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/OfflineSkins-HSPatched/commit/ad6461754f57a72f8a06b92fc63b046f55c7f126.diff" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
159.138.xxx.xxx - - [06/Mar/2025:12:01:13 +0100] "GET /szymonszl/teststand/raw/commit/578ededc2d435f3e10f8cc369dbbc61a251ab55d/dekoder.py HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/raw/commit/578ededc2d435f3e10f8cc369dbbc61a251ab55d/dekoder.py" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
122.8.xxx.xxx - - [06/Mar/2025:12:04:03 +0100] "GET /szymonszl/komeiji/raw/commit/518c001a3f4204996c18e37af7bbdce1d302bb9c/src/ts3/ts3.c HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/komeiji/raw/commit/518c001a3f4204996c18e37af7bbdce1d302bb9c/src/ts3/ts3.c" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
119.8.xxx.xxx - - [06/Mar/2025:12:06:49 +0100] "GET /szymonszl/teststand/blame/commit/6906e0005788dfa8996e30a164fae91a7d902752/tensometr.cpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/blame/commit/6906e0005788dfa8996e30a164fae91a7d902752/tensometr.cpp" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"
119.13.xxx.xxx - - [06/Mar/2025:12:12:27 +0100] "GET /szymonszl/teststand/raw/commit/bb7edb8a4b96397462d83853ec742772d8a3127d/teststand.hpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/raw/commit/bb7edb8a4b96397462d83853ec742772d8a3127d/teststand.hpp" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
111.119.xxx.xxx - - [06/Mar/2025:12:18:02 +0100] "GET /szymonszl/teststand/blame/commit/55bb1beec311646576a43a3b17066f8d4c1e4f39/lcd.cpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/blame/commit/55bb1beec311646576a43a3b17066f8d4c1e4f39/lcd.cpp" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
122.8.xxx.xxx - - [06/Mar/2025:12:20:51 +0100] "GET /szymonszl/teststand/raw/branch/master/battery.cpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/raw/branch/master/battery.cpp" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
94.74.xxx.xxx - - [06/Mar/2025:12:23:37 +0100] "GET /szymonszl/komeiji/raw/commit/012f890174ffd4ab7ad0bf942acdd8255c4b1f08/README.md HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/komeiji/raw/commit/012f890174ffd4ab7ad0bf942acdd8255c4b1f08/README.md" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
46.250.xxx.xxx - - [06/Mar/2025:12:29:15 +0100] "GET /szymonszl/teststand/blame/commit/122bc7795717c8d49b707b1e3dd5700ce5a74e5d/teststand.hpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/blame/commit/122bc7795717c8d49b707b1e3dd5700ce5a74e5d/teststand.hpp" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
46.250.xxx.xxx - - [06/Mar/2025:12:32:03 +0100] "GET /szymonszl/teststand/raw/commit/13135717f5627b8e98787ba2e57255ca0d7666c1/encoder.cpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/raw/commit/13135717f5627b8e98787ba2e57255ca0d7666c1/encoder.cpp" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
188.239.xxx.xxx - - [06/Mar/2025:12:34:50 +0100] "GET /szymonszl/komeiji/raw/commit/1d04180fe68c356337fd8c537ceaeee159563500/.gitmodules HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/komeiji/raw/commit/1d04180fe68c356337fd8c537ceaeee159563500/.gitmodules" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
124.243.xxx.xxx - - [06/Mar/2025:12:37:38 +0100] "GET /szymonszl/drawpad/src/branch/master/README.md?display=source HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/drawpad/src/branch/master/README.md?display=source" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
188.239.xxx.xxx - - [06/Mar/2025:12:40:26 +0100] "GET /szymonszl/komeiji/raw/commit/c70bfdb2881854463ad13d249810f53475b4983d/src/utils/map.h HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/komeiji/raw/commit/c70bfdb2881854463ad13d249810f53475b4983d/src/utils/map.h" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47"
190.92.xxx.xxx - - [06/Mar/2025:12:43:14 +0100] "GET /szymonszl/teststand/blame/commit/16bb6079ae6862c47f462295a8c643b06d7faa7b/intro.cpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/blame/commit/16bb6079ae6862c47f462295a8c643b06d7faa7b/intro.cpp" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
190.92.xxx.xxx - - [06/Mar/2025:12:46:02 +0100] "GET /szymonszl/komeiji/raw/commit/519a95d9117cfd107410b663823c0912b413778d/src/sock/url.c HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/komeiji/raw/commit/519a95d9117cfd107410b663823c0912b413778d/src/sock/url.c" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
190.92.xxx.xxx - - [06/Mar/2025:12:51:39 +0100] "GET /szymonszl/teststand/raw/commit/47b9546b1a9f554a9952755603e9ffa6802a7abe/tensometr.cpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/raw/commit/47b9546b1a9f554a9952755603e9ffa6802a7abe/tensometr.cpp" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
27.106.xxx.xxx - - [06/Mar/2025:12:57:14 +0100] "GET /szymonszl/teststand/blame/commit/db9835c438a850a5ea5138557a579bdde09c4f0b/ HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/blame/commit/db9835c438a850a5ea5138557a579bdde09c4f0b/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
119.13.xxx.xxx - - [06/Mar/2025:13:00:04 +0100] "GET /szymonszl/teststand/raw/commit/5ace600576361ce776ce9534e4f0b3c2dfd03429/network.cpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/raw/commit/5ace600576361ce776ce9534e4f0b3c2dfd03429/network.cpp" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
101.46.xxx.xxx - - [06/Mar/2025:13:02:49 +0100] "GET /szymonszl/ppk23/issues?q=&type=all&sort=&state=closed&labels=&project=-1&assignee=1&poster=0 HTTP/2.0" 200 1270 "https://git.szy.lol/szymonszl/ppk23/issues?q=&type=all&sort=&state=closed&labels=&project=-1&assignee=1&poster=0" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
190.92.xxx.xxx - - [06/Mar/2025:13:05:39 +0100] "GET /szymonszl/drawpad/src/commit/27bf160f34d224f5fa747637e5e6d568865e9422/util.c HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/drawpad/src/commit/27bf160f34d224f5fa747637e5e6d568865e9422/util.c" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
166.108.xxx.xxx - - [06/Mar/2025:13:08:27 +0100] "GET /szymonszl/komeiji/raw/commit/95ca83505c43daf5d5fdbf44bde78c8838f19b99/README.md HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/komeiji/raw/commit/95ca83505c43daf5d5fdbf44bde78c8838f19b99/README.md" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
111.119.xxx.xxx - - [06/Mar/2025:13:11:15 +0100] "GET /szymonszl/komeiji/raw/commit/519a95d9117cfd107410b663823c0912b413778d/src/utils/buffer.c HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/komeiji/raw/commit/519a95d9117cfd107410b663823c0912b413778d/src/utils/buffer.c" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47"
46.250.xxx.xxx - - [06/Mar/2025:13:14:05 +0100] "GET /szymonszl/teststand/raw/commit/393a4ce44fd6536acdf33a83cb45c5d827246420/measure.cpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/raw/commit/393a4ce44fd6536acdf33a83cb45c5d827246420/measure.cpp" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"
110.238.xxx.xxx - - [06/Mar/2025:13:22:28 +0100] "GET /szymonszl/teststand/raw/commit/c0c25c1e3ce99ba7537e439faeb32b64b70370b1/teststand.ino HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/raw/commit/c0c25c1e3ce99ba7537e439faeb32b64b70370b1/teststand.ino" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
124.243.xxx.xxx - - [06/Mar/2025:13:25:16 +0100] "GET /szymonszl/drawpad/src/commit/80b7531a89bd560088b4124e81406f9c2aab9200/drawpad.h HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/drawpad/src/commit/80b7531a89bd560088b4124e81406f9c2aab9200/drawpad.h" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
49.0.xxx.xxx - - [06/Mar/2025:13:30:52 +0100] "GET /szymonszl/teststand/raw/commit/c0c25c1e3ce99ba7537e439faeb32b64b70370b1/intro.cpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/raw/commit/c0c25c1e3ce99ba7537e439faeb32b64b70370b1/intro.cpp" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
111.119.xxx.xxx - - [06/Mar/2025:13:33:40 +0100] "GET /szymonszl/komeiji/raw/commit/c70bfdb2881854463ad13d249810f53475b4983d/.gitmodules HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/komeiji/raw/commit/c70bfdb2881854463ad13d249810f53475b4983d/.gitmodules" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47"
124.243.xxx.xxx - - [06/Mar/2025:13:36:27 +0100] "GET /szymonszl/teststand/raw/commit/c49626bc3bdb57a6ad95d716f50dd118e37235e2/intro.cpp HTTP/2.0" 200 278 "https://git.szy.lol/szymonszl/teststand/raw/commit/c49626bc3bdb57a6ad95d716f50dd118e37235e2/intro.cpp" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
This is not the first time I hear of this problem - most notably, Fossil has a dedicated option to require Javascript to fill in all HTML links to stop this exact thing from happening. I guess that in some time Gitea will add such a feature too. In the meantime, I’m on my own.
While I considered writing a more proper FastCGI proxy in C, which while fun, would be likely a little overkill. I can’t just enable Cloudflare, as I want to use SSH. There’s other ready-made solutions* for simple anti-DDOS captchas, but that’d be a bother to set up, and boring to the point I’d write my own. Instead, I added a very simple protection page, right in the nginx config. It uses a HTML form to set a “captcha” cookie with the result of a simple mathematical expression. The form isn’t traditionally submitted, the cookie is set clientside. This seems to do the trick. I only gated the deeper link pages, though if it comes down to it I might lock down all of the website. I do not want to do this though, since I presume it’ll interfere with git HTTP cloning, and I do not have anon SSH cloning up. None of the git cloning URLs are blocked, only the UI ones.
location ~ \/(src|commit|blame|raw|graph|compare|activity|milestones|pulls|archive|issues) { # sigh
if ($cookie_crappy_gitea_captcha != '4') {
add_header Content-Type 'text/html; charset=utf-8' always;
return 429 '<!doctype html><meta name=viewport content="width=device-width, initial-scale=1.0"><title>CAPTCHA</title><h1>CAPTCHA</h1><p>2 + 2 = <input type=number id=n></p><button onclick="yes();">OK</button><script>function yes() { document.cookie = "crappy_git_captcha="+encodeURIComponent(n.value)+";max-age=31536000;path=/"; location.reload(); } n.addEventListener("keypress", (e)=>((e.code=="Enter"||e.code=="NumpadEnter")&&yes()));</script>';
}
limit_req zone=gitea burst=100 delay=10;
limit_req_log_level warn;
limit_req_status 429;
proxy_set_header Host git.szy.lol;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_pass http://127.0.0.1:8880;
}
location / {
proxy_set_header Host git.szy.lol;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_pass http://127.0.0.1:8880;
}
The equation and cookie name have been changed in this snippet. Remember to change the proxy headers and address to suit your setup. Additionally, some further bot protection measures have been elided, like blocking Alibaba Cloud and Huawei Cloud IPs (which I also recommend doing). Also, there’s probably a better way to pick the locations without doubling the proxy_pass mess, but icba this works. Yay for inline HTML in configs!
Added (2025-07-27):
You can also change the status code in return 429 … to 200, apparently doing
so causes bots to stop retrying as they’ll be satisfied. You can of course
add more styling, more instruction or flavour text,
or a backlink to this post if you like it :^)
(though if you do, preferably don’t make it a hyperlink, I don’t need the bot traffic here too!)
About rate limiting, the above snippet includes the necessary
limit_req directive, but you will also need a limit_req_zone defined
outside the server {}. Check out the documentation, tune to taste.
Added (2025-10-30):
Small update! Added issues to the list, as the various sorting options seem
to be a fun playground for spiders. On top of that, I made submitting via Enter
work, and added the funny viewport tag so it’s usable on mobile.
You can test it out by going to any random file on my Gitea, like this one.
* Shoutouts to Anubis, a tool for fixing the exact same issue, for simultaneously complaining about AI scrapers and using AI-generated imagery in the blog post as well as the tool itself. It blocked me from the GNOME Gitlab today, which motivated me to write my setup up. I don’t think I’ll draw an anime girl mascot for CGC though.
PS. If you’re surprised that a captcha this simple works, notice how according to Codeberg, the scraper operators would rather spend $$$ running the PoW challenges rather than figure out that they can send an UA without “Mozilla” in it, bypassing Anubis altogether. (unless cb patched that out, i haven’t checked)