WikiDL

Simple yet scalable downloader for Wikipedia research data.

Python

from wikidl import WikiDL

downloader = WikiDL(
    num_proc=3,
    snapshot_date='20240801',
)
downloaded_files = downloader.start(output_dir='./output')

What is WikiDL?

WikiDL is a CLI downloader (with a Python library as well) for fetching Wikipedia data dumps, primarily from Wikimedia. The tool is designed for researchers who need to stay up to date with the latest Wikipedia content, which may not be fully captured by recent LLM training data. This matters because previously seen data can drastically affect experimental results. WikiDL makes this workflow as simple as possible and integrates cleanly with Slurm.

Source of Data

In the current version of WikiDL, the default download source is Wikimedia Dumps. You can specify an alternative route (for example, a mirror), but third-party sources are not officially supported.

WikiDL can theoretically support all types of downloads, but for now we focus on the latest article dump (LAD) and the edit history dump (EHD).

  • LAD: Latest revision of all articles existing towards the snapshot date.
  • EHD: All revisions (since very beginning) of all articles existing towards the snapshot date.

Key Features

WikiDL is designed to be researcher-friendly and easy to use in common workflows:

  • Slurm-compatible: Slurm is the most common environment where WikiDL is used, so we have made it easy to run under Slurm.
    • We avoid progress bars, since Slurm output renders them poorly.
    • We keep the setup for WikiDL as simple as possible, because it is not easy to configure and change things with a Slurm scheduler.
    • We provide example files you can copy, paste, and run directly.
  • Resumable downloads: When you download a large batch of files, WikiDL keeps track of which files are already complete and skips them on the next run. Resuming from a partially downloaded single file is not supported yet, but we may consider it in the future.

© 2025 Lingxi Li.

SF

WikiDL