About CatholicCorpus

Project Overview

CatholicCorpus is an open-access, NLP-ready corpus of Catholic texts spanning the entire arc of the Catholic intellectual tradition — from the Apostolic Fathers and the Patristic era through the Scholastic synthesis, the Counter-Reformation mystics, and into the modern Magisterium. It contains 67,772 content files totaling 35.9 GB, organized into 16 thematic collections.

The project exists to provide open infrastructure for Catholic AI, NLP research, and digital humanities. Pope Leo XIV has described AI development as a form of "participation in the divine act of creation" that carries "ethical and spiritual weight, for every design choice expresses a vision of humanity" (Message to the Builders AI Forum, 3 November 2025, citing Antiqua et Nova §37). Making the Church's textual heritage machine-readable is itself an act of service to this vision: it enables AI systems to ground their outputs in authentic, verifiable sources rather than drifting on unaccountable training data.

Every file in the corpus is accompanied by a provenance sidecar (_source.json) documenting its origin, license, and attribution requirements. The goal is a corpus that is fully transparent about what it contains, where it came from, and how it may be used — ordered to truth, the dignity of the human person, and the common good.


Principles

CatholicCorpus is guided by the alignment framework articulated at the Builders AI Forum (BAIF 2024 and 2025), the Rome Call for AI Ethics, and the Church's social teaching tradition. In practice, this means:

Serve the human person, do not substitute for her. The corpus is infrastructure for AI tools that augment human understanding — seminary study aids, scholarly research assistants, catechetical resources — not systems that replace the priest, the teacher, or the human encounter. As Antiqua et Nova §39 states, "full moral causality belongs only to personal agents, not artificial ones."

Ground generation in verifiable truth. Retrieval-augmented generation (RAG) with citations and source accountability is the Catholic alternative to ungrounded AI that hallucinates or distorts Church teaching. CatholicCorpus provides the knowledge base; the provenance sidecars provide the audit trail.

Digitize the tradition as an act of stewardship. Over 120 years of Magisterial teaching, along with 2,000 years of patristic, scholastic, and devotional writing, belong to the whole Church. Making these texts machine-readable and openly accessible is not a technical convenience — it is a service to the global Church, including communities that have been excluded from digitization efforts centered on the English-speaking world.

Build open, resist mission drift. CatholicCorpus is open-access and forkable. Its governance follows the principle of subsidiarity: anyone can use, extend, or improve the corpus without permission from a central authority. Economic and structural decisions are guided by stewardship, not extraction — in the spirit of related initiatives like the Catholic Digital Commons Foundation, which is building open-source tools for the Catholic digital community.

Operate by subsidiarity. Antiqua et Nova §42 and §110 name subsidiarity as the governance principle for AI: appropriate responses at each level of society. CatholicCorpus is open-access and forkable so that parishes, dioceses, schools, and developers can extend it without permission from a central authority.


Licensing

The corpus uses a three-tier licensing model:

Tier 1: Public Domain (the vast majority of the corpus). This includes all pre-1928 U.S. publications and works by authors who died more than 70 years ago. The Patrologia Latina (Migne, 1844–1865), the Patrologia Graeca, the Ante-Nicene and Nicene Fathers, the Catholic Encyclopedia (1913), the Aquinas Opera Omnia, and most other collections fall into this tier. These texts are free to use for any purpose without restriction.

Tier 2: Creative Commons Licensed. A small number of texts carry Creative Commons licenses. The SBL Greek New Testament is CC BY 4.0 (with an additional EULA for commercial use exceeding 25% of a work). The Rahlfs Septuagint (LXX) is CC BY-NC-SA 4.0. Attribution requirements are documented in each file's _source.json sidecar.

Tier 3: Copyrighted / Referenced Only. Task 17 (Modern Magisterium) includes download scripts for Vatican II documents, the Catechism of the Catholic Church, the 1983 Code of Canon Law, and modern papal encyclicals. Because this content is under copyright, the scripts are provided but the downloaded content is not redistributed with the corpus. Users who run the scripts accept responsibility for their own use.


What's Included

The corpus is organized into 16 collections:

Git Repositories (Aquinas, CSEL, Bible Texts)

4,668 content files · 3.4 GB

Sources: Aquinas Opera Omnia — Complete Works (Bilingual Latin-English); Rahlfs Septuagint (1935) — Digital Edition; Robinson-Pierpont Byzantine Majority Text (2018); CSEL — Corpus Scriptorum Ecclesiasticorum Latinorum; BibleNLP / eBible Corpus. Primary language: Latin, English.

Corpus Corporum (Latin Patristics & Medieval)

29,820 content files · 4.2 GB

Sources: Corpus Corporum — Platform Software (GitHub); Corpus Corporum — Repositorium operum Latinorum apud universitatem Turicensem. Primary language: N/A (software).

Project Gutenberg (Catholic Classics)

18 content files · 9.0 MB

Direct Downloads (Bibles, Catechisms, Canon Law)

1,244 content files · 42.1 MB

Sources: Brenton English Septuagint (1851); 1917 Pio-Benedictine Code of Canon Law (Latin); Clementine Vulgate (Tweedale Transcription); Council of Trent — Canons and Decrees (Waterworth Translation, 1848); Roman Catechism — McHugh & Callan Translation (1923); SBL Greek New Testament (SBLGNT); Textus Receptus — Greek New Testament; True Devotion to Mary — St. Louis de Montfort; Westcott-Hort Greek New Testament (1881). Primary language: English.

CCEL (Church Fathers & Devotional)

61 content files · 691.2 MB

Sources: Ante-Nicene Fathers (ANF), 10 Volumes; Catholic Devotional Classics Collection (from CCEL); Nicene and Post-Nicene Fathers, Series I (14 Volumes); Nicene and Post-Nicene Fathers, Series II (14 Volumes); Summa Theologiae — English Translation (Dominican Province). Primary language: English.

Catholic Encyclopedia (1913)

18 content files · 1.0 GB

Sources: The Catholic Encyclopedia (1913 Edition, 15 Volumes). Primary language: English.

Latin Scholastics & Franciscan Texts

30 content files · 1.1 GB

Sources: Anselm of Canterbury (Latin); Aquinas Summa Contra Gentiles (English Translation); Aquinas Summa Theologiae (Latin); Bonaventure — Works (Latin/English); Church Fathers — Latin Texts (Augustine, Jerome, Ambrose, Gregory, Bernard); Peter Lombard — Sentences (Latin). Primary language: Latin.

Liturgical Texts, Hymns & Encyclicals

31,141 content files · 202.4 MB

Sources: Divinum Officium — Traditional Latin Divine Office; Latin and English Hymns (Public Domain); Pre-1928 Papal Encyclicals (English Translations). Primary language: Latin (primary), English (translations).

Archive.org (Hagiography, Councils, Albert)

65 content files · 2.1 GB

Sources: Albert the Great — Philosophical and Theological Works; Catholic Catechisms — Catechism of Pius X (1908) and Penny Catechism; Ecumenical Councils (Latin) — Vatican I and Council of Trent; Butler's Lives of the Saints (1756-1759); The Golden Legend (Jacobus de Voragine, Ellis translation 1900). Primary language: Latin.

Catholic Bibles (Douay-Rheims, Knox, etc.)

12 content files · 673.3 MB

English Catholic Thinkers (Newman, More, etc.)

45 content files · 841.2 MB

Catholic Mystics (Teresa, John of the Cross, etc.)

13 content files · 182.2 MB

Liturgical Expansion (Missals, Breviaries)

16 content files · 2.2 GB

Late Scholastics (Suárez, Bellarmine, etc.)

42 content files · 3.9 GB

Patrologia Graeca (Migne, 161 volumes)

471 content files · 15.5 GB

Modern Magisterium (Vatican II, CCC, Encyclicals)

108 content files · 7.4 MB


Known Gaps

The corpus is comprehensive but not exhaustive. Known gaps include:

Patrologia Graeca volumes 16 and 86 are missing from the archive.org holdings and could not be downloaded. The remaining 159 of 161 volumes are present.

23 lending-restricted items on archive.org were identified but could not be bulk-downloaded due to the Internet Archive's controlled digital lending program. These are documented in LENDING_RESTRICTED.md at the corpus root.

No Task 10. Task 10 was originally planned for a collection that was incorporated into other tasks during development. The numbering gap (09 to 11) is intentional.

Modern Magisterium (Task 17) is script-only. The downloaded content is copyrighted and cannot be redistributed.

Text extraction quality varies by source format. The extracted plain text in catholiccorpus-text is machine-generated from the raw source files. XML and HTML sources produce clean text; PDF sources (especially older archive.org scans) may contain OCR artifacts. The v1.2 OCR Enhancement milestone will address the lowest-quality extractions.


Attribution

CatholicCorpus was assembled from the work of many digitization projects. We are especially grateful to:

The Internet Archive (archive.org) for hosting millions of public-domain texts and providing open programmatic access.

Corpus Corporum (mlat.uzh.ch), a project of Prof. Philipp Roelli at the University of Zurich, for providing structured TEI XML editions of the Patrologia Latina and other Latin collections.

The Christian Classics Ethereal Library (CCEL) at Calvin University, founded by Harry Plantinga, for making the Ante-Nicene Fathers, Nicene and Post-Nicene Fathers, and other patristic and devotional works freely available in multiple digital formats.

Project Gutenberg for its foundational work in open-access digital publishing.

eBible.org for providing Bible translations in machine-readable formats.

The Franciscan Archive (franciscan-archive.org) for comprehensive digital editions of Bonaventure and other Franciscan authors.

The Society of Biblical Literature for making the SBL Greek New Testament available under a Creative Commons license.


Contact

For questions, corrections, or contributions, write to admin@catholiccorpus.org.