When developing machine learning (ML) models with pre-existing datasets, professionals need to understand the data, interpret its structure, and decide which subsets to use as features. The significant range of data formats poses a barrier to ML advancement. These may include text, structured data, photos, audio, and video, to name a few examples. Even within datasets of the same subject matter, there is no standard layout of files or data formats, impeding productivity in machine learning development.
Formats like schema.org and DCAT represent database metadata, but they are not designed with ML data in mind. Machine learning datasets require unique handling abilities, such as combining and extracting data from structured and unstructured sources, including metadata for responsible data use, and defining ML usage characteristics, such as training, test, and validation sets.
To address this, Google has introduced a new metadata format for ML-ready datasets called Croissant. Along with Croissant comes an open-source Python library for validating, consuming, and generating metadata. It also features an open-source visual editor for loading, inspecting, and creating dataset descriptions. The Croissant format offers a more organized and descriptive method for data, without changing the actual representation of the data. It is an extension of schema.org, the most widely-used standard for publishing structured data online.
Mainly focussed on promoting Responsible AI, Croissant includes an array of properties that describe various use cases, such as data life cycle management, labeling, participatory data, ML safety and fairness, accountability, and more. Metadata can help locate the correct dataset, easing the process of data cleaning, refining, and analysis.
With it, dataset writers can increase their datasets’ discoverability and usability, adding value to their datasets without additional effort. The UI of the Croissant editor allows users to examine and alter the metadata, an essential part of publishing the datasets. Users can make Croissant data easily discoverable and reusable by publishing it on their dataset website. Additionally, Croissant metadata is automatically generated if users post their data to a Croissant-compatible repository.
Popular machine learning dataset repositories, such as Kaggle, Hugging Face, and OpenML, now support the Croissant format. Popular ML frameworks – TensorFlow, PyTorch, and JAX – received updates that allow them to more easily load Croissant datasets with the TensorFlow Datasets package.
The creators of Croissant encourage platforms that host datasets to make Croissant files available for download and to provide this information on their dataset web pages, making it easier for search engines to locate them. Toolmakers for data analysis and labeling, and others that assist users with ML datasets, should also consider adding support for Croissant datasets. These coordinated efforts, the creators believe, could lighten the load of data development and pave the way for improved ML research and development.
An exploration of the Croissant project is accessible through a published blog post. The research was credited to the project researchers, and further communications are available through various social media channels like Twitter, Google News, Reddit, Facebook, Discord Channel, and LinkedIn Group. A newsletter is also available for those interested in the project. Free AI courses are also offered for those interested in the field.