Python
from wikidl import WikiDL downloader = WikiDL( num_proc=3, snapshot_date='20240801', ) downloaded_files = downloader.start(output_dir='./output')
WikiDL is a CLI downloader (with a Python library as well) for fetching Wikipedia data dumps, primarily from Wikimedia. The tool is designed for researchers who need to stay up to date with the latest Wikipedia content, which may not be fully captured by recent LLM training data. This matters because previously seen data can drastically affect experimental results. WikiDL makes this workflow as simple as possible and integrates cleanly with Slurm.
In the current version of WikiDL, the default download source is Wikimedia Dumps. You can specify an alternative route (for example, a mirror), but third-party sources are not officially supported.
WikiDL can theoretically support all types of downloads, but for now we focus on the latest article dump (LAD) and the edit history dump (EHD).
WikiDL is designed to be researcher-friendly and easy to use in common workflows:
© 2025 Lingxi Li.
San FranciscoSF