Crawling a billion web pages in just over 24 hours, in 2025

148 points by pseudolus 18 hours ago | 50 comments

bndr
I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.
Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.
- mettamage
  I wonder if circumvention is legal. It's so odd. In the US it seems you can just do this whereas if I'd start something like this in the EU, I don't think I could.
  fuomag9
  In Italy it’s a crime punishable up to 12 years to access any protected computer system without authorization, especially if it causes a DoS to the owner
  Consider the case of selfhosting a web service on a low performance server and the abusive crawling goes on loop fetching data (which was happening when I was self hosting gitlab!)
  https://www.brocardi.it/codice-penale/libro-secondo/titolo-x...
- mrweasel
  Can't your users just whitelist your IPs?
  dewey
  I'm in a similar boat and getting customers to whitelist IPs is always a big ask. In the best case they call their "tech guy", in the worst case it's a department far away and it has to go through 3 layers of reviews for someone to adapt some Cloudflare / Akamai rules.
  And then you better make sure your IP is stable and a cloud provider isn't changing any IP assignments in the future, where you'll then have to contact all your clients again with that ask.
  bndr
  They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.
  cassepipe
  Would it make sense to advertise to the more technical minded a discount if they set up an IP whitelist with a tutorial you could provide ? A discount in exchange for reduced costs to you ?
- 0xdeadbeefbabe
  Blocking seems really popular. I wonder if it coincides with stack overflow closing.
- gilrain
  > the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries
  I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.
  bndr
  Please elaborate, why exactly is it antisocial? Because Cloudflare decides who can or cant access a users website? When they specifically signed up for my service.
  demetris
  But how does that work?
  Does Cloudflare force firewall rules for those who choose to use it for their websites?
  If the tool that does the crawling identifies itself properly, does Cloudflare block it even if users do not tell Cloudflare to block it?
  gilrain
  It intentionally circumvents the explicit desires of those who own the websites being exploited. It is nonconsensual. It says “fuck you, yes” to a clearly-communicated “please no”.
  joncrane
  OP literally said that users add their domains, meaning they are explicitly ASKING OP to scrape their websites.
  bndr
  Users sign up for my service.
  gilrain
  You employ residential proxies. As such, you enable and exploit the ongoing destruction of the Internet commons. Enjoy the money!
  christoff12
  This is kind of like getting upset with people who go to ATMs because drug dealers transact in cash lol.
  toomuchtodo
  Cloudflare and Big Tech are primary contributors to the impairment and decline of the Internet commons for moats, control, and profit; you are upset at the wrong parties.
  prettyblocks
  I would argue that the ability to crawl and scrape is core to the original ethos of the internet and all the hoops people jump through to block non-abusive scraping of content is in fact more anti-social than circumventing these mechanisms.
- spiderfarmer
  Just stop scraping. I'll do everything to block you.
  ssgodderidge
  > in my case, users add their own domains
  Seems like they're only scraping websites their clients specifically ask them to
  Keyframe
  Now you've gamified it :)
  shimman
  It's a pretty easy game to win as the blocker. If you receive too many 404s against pages that don't exist, just ban the IP for a month. Actually got the idea from a hackernews comment too. Also thinking that if you crawl too many pages you should get banned as well.
  There's no point in playing tug of war against unethical actors, just ban them and be done with it.
  I don't think it's an uncommon opinion to behave this way either, nor are the crawlers users I want to help in any capacity either.
  stevewodil
  What is the crawler is using a shared IP and you end up blocking legitimate users with the bad actor?
  Keyframe
  He said "it's pretty easy", probably not realizing there are whole industries on both sides of that cat and mouse game, making it not easy.
throwaway77385
> spinning disks have been replaced by NVMe solid state drives with near-RAM I/O bandwidth
Am I missing something here? Even Optane is an order of magnitude slower than RAM.
Yes, under ideal conditions, SSDs can have very fast linear reads, but IOPS / latency have barely improved in recent years. And that's what really makes a difference.
Of course, compared to spinning disks, they are much faster, but the comparison to RAM seems wrong.
In fact, for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU, so VRAM needs to be used. That's how latency-sensitive some applications have become.
- fluoridation
  >for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU
  That's not why. It's because RAM has a narrower bus than VRAM. If it was a matter of distance it'd just have greater latency, but that would still give you tons of bandwidth to play with.
  dist-epoch
  You could be charitable and say the bus is narrow because it has to travel a long distance and this makes it hard to have a lot of traces.
  fluoridation
  It's not. It's narrow even between the CPU and RAM. That's just the way x86 is designed. Nvidia and AMD by contrast have the luxury of being able to rearchitect their single-board computers each generation as long as they honor the PCIe interface.
  It is also true that having a 384-bit memory bus shared with the video card would necessitate a redesigned PCIe slot as well as an outrageous number of traces on the motherboard, though.
  adrian_b
  Traditionally, the width of the GPU memory interfaces was many times greater than that of CPUs.
  However the maximum width in consumer GPUs, of up to 1024-bit, has been reached many years ago.
  Since then the width of the memory interfaces in consumer GPUs has been decreasing continuously, and this decrease has been only partially compensated by higher memory clock frequencies. This reduction has been driven by NVIDIA, in order to increase their profit margins by reducing the memory cost.
  Nowadays, most GPU owners must be content with a memory interface no better than 192-bit, like in RTX 5070, which is only 50% wider than for a desktop CPU and much narrower than for a workstation or server CPU.
  The reason why using the main memory in GPUs is slow has nothing to do with the width of the CPU memory interface, but it is caused by the fact that the GPU accesses the main memory through PCIe, so it is limited by the throughput of at most 16 PCIe lanes, which is much lower than that of either the GPU memory interface or the CPU memory interface.
  dist-epoch
  ThreadRipper has 8 memory channels versus 2 for a desktop AMD CPU. It's not an x86 limitation.
  fluoridation
  "x86" as in the computer architecture, not the ISA. Why do you think they put extra channels instead of just having a single 512-bit bus?
  adrian_b
  The memory interface of CPUs is made wider by adding more channels because there are no memory modules with a 512-bit interface. Thus you must add multiples of the module width to the CPU memory interface.
  This has nothing to do with x86, but it is determined by the JEDEC standards for DRAM packages and DRAM modules. The ARM server CPUs use the same number of memory channels, because they must use the same memory modules.
  A standard DDR5 memory module has a width of the memory interface that is of 64-bit or 72-bit or 80-bit, depending on how many extra bits may be available for ECC. The interface of a module is partitioned in 2 channels, to allow concurrent accesses at different memory addresses. Despite the fact that the current memory channels have a width of 32-bit/36-bit/40-bit, few people are aware of this, so by "memory channel" most people mean 64 bits (or 72-bit for ECC), because that was the width of the memory channel in older memory generations.
  Not counting ECC bits, most desktop and laptop CPUs have an 128-bit memory interface, some cheaper server and workstation CPUs have a 256-bit memory interface, many server CPUs and some workstation CPUs have a 512-bit memory interface, while the state-of-the-art server CPUs have a 768-bit memory interface.
  For comparison, RTX 5070 has a 192-bit memory interface, RTX 5080 has a 256-bit memory interface and RTX 5090 has a 512-bit memory interface. However, the GDDR7 memory has a transfer rate that is 4 to 5 times higher than DDR5, which makes the GPU interfaces faster, despite their similar or even lower widths.
finnlab
Nice work, but I feel like it's not required to use AWS for this. There are small hosting companies with specialized servers (50gbit shared medium for under 10$), you could probably do this under 100$ with some optimization.
- nurettin
  I did some crawling on hetzner back in the day. They monitor traffic and make sure you don't automate publically available data retrieval. They send you an email telling you that they are concerned because you got the ip blacklisted. Funny thing is: They own the blacklist that they refer to.
  jeroenhd
  If Hetzner actually puts their own customers on their blacklist then that list becomes more trustworthy.
  They were right to blacklist you, they were right to complain to you, and they were right not to assume malice and kick you off their platform/shut down your server.
  qingcharles
  This. I tried to run a very slow DHT scraper I was writing on a Hetzner server and within minutes they were on my ass. I don't want to make an enemy of them so I killed it immediately, but they are clearly very sensitive to anything outside of "normal".
- varispeed
  This. AWS is like a cash furnace, only really usable for VC backed efforts with more money than sense.
snowhale
The anti-bot stuff mentioned upthread is real, but at this scale per-domain politeness queuing also becomes a genuine headache. You end up needing to track crawl-delay directives per domain, rate-limit your outbound queues by host, and handle DNS TTL properly to avoid hammering a CDN edge that's mapping thousands of domains to the same IPs. Most crawlers that work fine at 100M pages break somewhere in that machinery at 1B+.
- overfeed
  > this scale per-domain politeness queuing also becomes a genuine headache
  Not really a headache - if you've ever implemented resource-based, server-side rate limiting (per-endpoint, with client-ID and/or IP buckets), that's all the logic that's required, adapted for the client side. One could wrap rate-limiting libraries designed for server-side usage and call it a day.
  I hate how people who a bad at parallelizing their user-agents across the internet are causing needless pain and giving scrapers a bad name. They are also causing blowback on the more well-behaved scrapers.
dangoodmanUT
> because redis began to hit 120 ops/sec and I’d read that any more would cause issues
Suspicious. I don’t think I’ve ever read anything that says redis taps out below tens of thousands of ops…
thefounder
Well the most important part seems to be glossed over and that’s the IP addresses. Many websites simply block /want to block anything that’s not google and is not a “real user”.
ph4rsikal
When I read this, I realize how small Google makes the Internet.
sunpolice
I was able to get 35k req/sec on a single node with Rust (custom http stack + custom html parser, custom queue, custom kv database) with obsessive optimization. It's possible to scrape Bing size index (say 100B docs) each month with only 10 nodes, under 15k$.
Thought about making it public but probably no one would use it.
- charlesdenault
  please do
  mamsouuu
  Yes! Please do!
handfuloflight
There was a time when being able to do this meant you were on the path to becoming a (m)(b)illionaire. Still is, I think.
corv
Python is obviously too slow for web-scale
gethly
> I also truncated page content to 250KB before passing it to the parser.
WTF did I just read?
- tengada1
  It's just HTML, presumably not requesting JS libraries. So 250K is a large amount.
  gethly
  Exactly - how can a html page need to be trimmed to 250 KB??? That is insane. Something is not right with this article.
T3RMINATED
[dead]