Download the Corpus
CatholicCorpus is 35.9 GB of Catholic texts spanning 2,000 years. There are three ways to obtain it, depending on your needs. Whether you are building a RAG-based study tool, training a language model, or conducting digital humanities research, you are contributing to the shared infrastructure the Catholic AI community needs — grounded in authentic sources, ordered to the dignity of the human person and the common good.
1a. Extracted Text (Recommended)
Pre-extracted plain text from all 16 collections, ready for tokenization, embedding, and retrieval. This is what most NLP, RAG, and machine learning users want.
from datasets import load_dataset
ds = load_dataset("CatholicCorpus/catholiccorpus-text") huggingface.co/datasets/CatholicCorpus/catholiccorpus-text
Every source file (PDF, EPUB, HTML, TEI XML) has been converted to plain .txt with provenance metadata preserved. The text corpus is significantly smaller than the raw source files, making it faster to download and easier to work with.
1b. Raw Source Files
The full corpus in original formats — PDF, EPUB, TEI XML, HTML — with provenance metadata for every file. Use this if you need page layouts, original markup, scanned images, or want to run your own extraction pipeline.
from datasets import load_dataset
ds = load_dataset("CatholicCorpus/catholiccorpus") 2. Download Scripts (Git Clone)
The CatholicCorpus GitHub repository contains the download scripts used to assemble the corpus from its original sources. Clone the repo and run the scripts to reconstruct the entire corpus from scratch — useful for reproducibility, for customizing which collections to include, or for updating the corpus with new sources.
git clone https://github.com/CatholicCorpus/catholiccorpus.git
cd catholiccorpus
bash run_all.sh The scripts download from archive.org, CCEL, Corpus Corporum, Project Gutenberg, and other original sources. Each script is self-contained and can be run independently to download a single collection.
3. Individual Downloads
Every text in the corpus is drawn from a publicly accessible source. If you only need specific texts rather than the full corpus, you can download them directly from their original hosts:
Archive.org — Patrologia Graeca, Catholic Encyclopedia, mystics, scholastics, hagiography, and more. CCEL — Ante-Nicene Fathers, Nicene and Post-Nicene Fathers, devotional works. Corpus Corporum — Patrologia Latina and medieval Latin texts. Project Gutenberg — Catholic classics in English. eBible.org — Bible translations in multiple formats.
Browse the full corpus contents on the Browse page to find source URLs for individual texts.
Requirements
If you plan to run the download scripts to build the corpus from source, you will need:
Software: Python 3.8 or later, git, and the Python packages requests and lxml. Some scripts also use beautifulsoup4 and internetarchive.
Disk space: At least 40 GB free. The final corpus is 35.9 GB, but intermediate downloads and extraction may require additional space.
Time and patience: Several collections are downloaded from archive.org, which enforces rate limits. A full build may take several hours. The scripts are designed to be re-run safely — they skip files that have already been downloaded.