Export to
Thursday, February 6, 2025 at 8:29pm.
PDXRust Meetup: Spidering Wikipedia Politely In Async Rust
Please plan to arrive between 6:30 and 7. Due to limitations of the venue, we need to have someone stand outside and let people in, and we'd like them to be able to attend, so the doors will be effectively closed at 7:00, unless you're a PSU student.
Website
Description
How many pages are reachable from Wikipedia's page on the Rust programming language in two hops? Around 30,000, it turns out, including pages on wheat flour, Welsh orthography, and the zombie apocalypse.
As it turns out, it's super easy to do this exploration using asynchronous Rust code. Wikipedia offers a cute little REST API for querying links, and it's easy to use Serde to generate requests and parse replies. And if you're feeling guilty about flooding a precious public resource with silly API requests, it's also super easy to do rate limiting.
Jim Blandy will show how to wire up Tokio, Reqwest, and Serde to do the spidering, and whip up a mock server for testing using Warp. The techniques shown work nicely for all kinds of REST API scripting, including, say, GitHub.