Tech & Innovation - December 11, 2024

Harvard University Releases Public-Domain Book Dataset fo...

Image related to the article
Harvard University has announced the release of a high-quality dataset of nearly one million public-domain books. The dataset, created by Harvard's newly formed Institutional Data Initiative with funding from Microsoft and OpenAI, is available for anyone to train large language models and other AI tools.

Read more at source.

Democratizing AI Development

The Institutional Data Initiative's database spans genres, decades, and languages. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to level the playing field by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble.

Future Implications for AI Training

The new public domain database could be used in conjunction with other licensed materials to build artificial intelligence models. As lawsuits over the use of copyrighted data for training AI wind their way through the courts, the future of how AI tools are built hangs in the balance. Projects like the Harvard database are plowing forward under the assumption that there will always be an appetite for public domain datasets.

Additional Projects and Future Collaborations

In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain. They are open to forming similar collaborations in the future. Other projects, startups, and initiatives promise to give companies access to substantial and high-quality AI training materials without the risk of running into copyright issues.

I think about it a bit like the way that Linux has become a foundational operating system for so much of the world. Companies would still need to use additional training data to differentiate their models from those of their competitors. - Greg Leppert, Executive Director of the Institutional Data Initiative