AI2 Releases Huge Open Language Dataset

Dolma releases huge open language dataset

AI2 has released the dataset, which is made up of over 8 million English language documents, to the public domain. Researchers can use Dolma to train language models, study the data, and explore potential applications. AI2 releases and hopes that the availability of this data will promote more transparent and responsible practices for building and using AI with this open language dataset.

AI2 researchers argue that the dataset they use to create the model should be as free to use and modify as the model itself. Luca Soldaini, from AI2, explains the choice of sources and processes used to render the dataset suitable for AI consumption in a blog post. AI2 releases and also states that a more comprehensive paper is in the works.

OpenAI and Meta, among other companies, release some of the crucial information about the datasets they use to construct their open language dataset models. However, much of that data is kept private. This closed approach not only thwarts scrutiny and improvement in the field, but also raises suspicions that the data may have been obtained unethically or unlawfully, such as via pirated copies of numerous authors’ books.

Authors throughout the world have signed a letter calling on AI creators to stop taking books without permission. They emphasize the necessity of obtaining proper authorization before using any copyrighted material. They also demand more transparency surrounding the criteria used to determine the quality of text and accuracy of data. Lastly, they insist that any personal data be removed from the datasets.

AI2 Releases huge open language dataset model charts
Image Credit: AI2

These companies have the right to protect the secrets of their models’ training processes in the highly competitive AI landscape. However, this lack of transparency makes it harder for researchers outside the companies to study or replicate the datasets and models. In contrast, AI2’s Dolma discloses all its sources and processes, such as how and why it was limited to original English language texts, in order to promote openness.

AI2 releases and optimizing a large open language dataset model specifically for science.

Prospective users of Dolma must submit their contact information and intended use cases, disclose any Dolma-derivative creations, and distribute those derivatives under the same license. They must also agree not to apply Dolma to various prohibited areas, such as surveillance or disinformation.

If anyone is concerned that their personal data may have inadvertently been included in the database, they can submit a removal request form on AI2’s website.

You can access Dolma through Hugging Face if you wish.