How would you implement rate limiting in this? It would kind of involve shared state between all threads, no? Or can you somehow apply a limit to the executor?
Sharing things between tasks is ok with threads, as long as you do atomic operation or you lock.
Of course, as soon as you share anything in a concurrent environment, you invite a lot of complexity back.
If you want to to scrap a few URLs, the ThreadExecutor makes sense. If you want to do it seriously, using a framework like scrapy (delegating to libs and infras) is preferable, as it has done the hard part for you. Espacially, it will implement retry, exponential backoff, jitter, insert randomness in everything, etc. Not to mention with cloudflare protecting half the web nowaday, you may need a full Saas to bypass it.
But if you have to crawl your intranet, an executor is the Pareto solution.
How would you implement rate limiting in this? It would kind of involve shared state between all threads, no? Or can you somehow apply a limit to the executor?
With a semaphore (https://docs.python.org/3/library/threading.html#semaphore-objects) or something similar.
Sharing things between tasks is ok with threads, as long as you do atomic operation or you lock.
Of course, as soon as you share anything in a concurrent environment, you invite a lot of complexity back.
If you want to to scrap a few URLs, the ThreadExecutor makes sense. If you want to do it seriously, using a framework like scrapy (delegating to libs and infras) is preferable, as it has done the hard part for you. Espacially, it will implement retry, exponential backoff, jitter, insert randomness in everything, etc. Not to mention with cloudflare protecting half the web nowaday, you may need a full Saas to bypass it.
But if you have to crawl your intranet, an executor is the Pareto solution.