WikiDL
Say goodbye to the hassle of downloading Wikipedia data dumps at scale.
WikiDL is a CLI downloader (also have Python version) for downloading wikipedia data dump (mainly coming from Wikimedia). The tool is designed for researchers to quickly and conveniently stay up to date with latest Wikipedia content, which is less possible to be seen by recent LLMs. It is important because a previously-seen data could drastically affect the experiment result. WikiDL just makes it as easy as possible, and lay into Slurm perfectly.
In current WikiDL, the download source is Wikimedia Dumps. You are able to specify the route (e.g. a mirror), but those third-party sources will not be officially supported.
WikiDL can theoretically cover all types of downloads, but currently we are focusing on downloading the latest article dump (LAD), and edit history dump (EHD).
The design of WikiDL is researcher-oriented. That means, we are considering use cases of researchers first. Slurm is the most common use case where WikiDL will be used, so we have made it more friendly and more easily to be used within the environment of Slurm. Such that:
We support file-level resuming. When you are trying to download a huge batch of files, we can keep track of what files are already downloaded, and do not re-download them again. We currently don't support resuming from a half-downloaded file, but we will consider it in the future.