CatholicCorpus

The open textual layer for Catholic AI — 2,000 years of the Catholic intellectual tradition, structured for NLP, RAG, and digital humanities.

2.6B Tokens
67,772 Documents
3 Languages
2,000 Years of Texts

Latin (~60%) · English (~25%) · Greek (~15%)


What's Included

The corpus is organized into 16 thematic collections, drawn from major open-access repositories of Catholic scholarship. Every file is accompanied by provenance metadata documenting its source, license, and attribution requirements.

Collection Files Size
Git Repositories (Aquinas, CSEL, Bible Texts) 4,668 3.4 GB
Corpus Corporum (Latin Patristics & Medieval) 29,820 4.2 GB
Project Gutenberg (Catholic Classics) 18 9.0 MB
Direct Downloads (Bibles, Catechisms, Canon Law) 1,244 42.1 MB
CCEL (Church Fathers & Devotional) 61 691.2 MB
Catholic Encyclopedia (1913) 18 1.0 GB
Latin Scholastics & Franciscan Texts 30 1.1 GB
Liturgical Texts, Hymns & Encyclicals 31,141 202.4 MB
Archive.org (Hagiography, Councils, Albert) 65 2.1 GB
Catholic Bibles (Douay-Rheims, Knox, etc.) 12 673.3 MB
English Catholic Thinkers (Newman, More, etc.) 45 841.2 MB
Catholic Mystics (Teresa, John of the Cross, etc.) 13 182.2 MB
Liturgical Expansion (Missals, Breviaries) 16 2.2 GB
Late Scholastics (Suárez, Bellarmine, etc.) 42 3.9 GB
Patrologia Graeca (Migne, 161 volumes) 471 15.5 GB
Modern Magisterium (Vatican II, CCC, Encyclicals) 108 7.4 MB
Total 67,772 35.9 GB

Why This Exists

Catholic intellectual history is one of the longest, richest, and most internally coherent textual traditions in the world — from the Apostolic Fathers through the Patristic era, the Scholastic synthesis, the Counter-Reformation, and into the modern Magisterium. Yet until now, there has been no single, open, machine-readable corpus that makes this tradition accessible to researchers working in natural language processing, computational theology, or digital humanities.

CatholicCorpus exists so that builders working in the Catholic intellectual tradition have an open, citable, ML-ready textual foundation. It is neither a substitute for the Church's teaching nor a competitor to commercial Catholic chatbots, but the open data layer that makes both more accountable to source texts. Pope Leo XIV has described AI development as a form of "participation in the divine act of creation" that carries "ethical and spiritual weight, for every design choice expresses a vision of humanity" (Message to the Builders AI Forum, 3 November 2025, citing Antiqua et Nova §37).

The Vatican's Antiqua et Nova and the Rome Call for AI Ethics both identify grounded, verifiable AI as a moral priority. CatholicCorpus provides the open infrastructure that retrieval-augmented generation (RAG) systems, scholarly search tools, and Catholic AI projects need to ground their outputs in authentic sources — with citations and provenance, not hallucination.

Existing efforts are either proprietary and commercially gated, fragmentary (individual digitization projects covering one era or one language), or locked behind interfaces that prohibit bulk access. CatholicCorpus changes this. Every text here is either public domain or openly licensed. Every file has a provenance sidecar documenting where it came from, who digitized it, and under what terms it may be used. The corpus spans Latin, Greek, and English — serving the global Church, not only the English-speaking world.


How to Use It

CatholicCorpus is designed for three primary use cases — all oriented toward AI that augments the human person rather than substituting for her:

Grounded generation (RAG). Every file is indexed with structured metadata. The corpus can serve as the knowledge base for AI systems that need to ground their responses in the Catholic textual tradition — with citations and source accountability — rather than relying on ungrounded generation that hallucinates or distorts Church teaching. This is the use case most directly aligned with Catholic AI principles: expanding knowledge bases with authentic Magisterial teaching so that AI tools serve truth, not replace discernment.

NLP and machine learning. The corpus provides over 35 GB of text in multiple languages (Latin, Greek, English) across diverse genres — patristic theology, scholastic philosophy, liturgical poetry, hagiography, canon law, biblical texts, and papal encyclicals. This makes it suitable for training or fine-tuning language models on Catholic theological and philosophical discourse, supporting the kind of tool AI that keeps the human person in the loop as the moral agent and interpreter of meaning.

Digital humanities research. The breadth of the corpus (from 1st-century Church Fathers through 21st-century papal documents) enables longitudinal studies of theological vocabulary, doctrinal development, genre evolution, and cross-linguistic transmission. Semantic search across Latin, Greek, and English can surface connections across the global Church's traditions — including non-Western theological voices preserved in these texts.

Ready to get started? Visit the Download page for instructions on obtaining the full corpus.


Roadmap

v1.0 — Foundation Release Shipped
The full corpus is live on Hugging Face and GitHub. 2.6 billion tokens, 67,772 documents, 16 collections, provenance metadata for every file.

v1.1 — Extracted Text Layer Shipped
Pre-extracted plain text from all PDF, EPUB, HTML, and TEI XML sources, published as a separate dataset: catholiccorpus-text.

v1.2 — OCR Enhancement Target: Q4 2026
Re-OCR of low-quality archive.org scans with quality comparison reports. Unlocks full-text search across scan-heavy collections.

v2.0 — Aligned Bilingual Editions Open-ended
Sentence-aligned Latin-English parallel texts for Aquinas, Bonaventure, and Augustine.

See ROADMAP.md for full milestone definitions.