Data scientists and engineers often encounter difficulties when collaborating on machine learning (ML) tasks due to concerns about data reproducibility and traceability. Software code tends to be transparent about its origin and modifications, but it’s often hard to ascertain the exact provenance of the data used for training ML models and the transformations conducted.
To tackle the above issue, Amazon’s AI researchers and engineers have developed DATALORE, a machine learning system that automatically identifies and generates data transformations in a shared data repository. This system employs a generative strategy, using Large Language Models (LLM) trained on billions of lines of code, reducing semantic ambiguity and manual work.
The method begins with the search of related candidate tables for the provided base table, through data discovery algorithms. This procedure facilitates a series of data transformations and enhances the system’s efficiency. Then, to refine the enhanced table, DATALORE follows the Minimum Description Length concept, minimizing the number of linked tables and avoiding intricate search spaces.
Users can employ DATALORE through cloud computing platforms like Amazon Web Services, Microsoft Azure, and Google Cloud for data governance, data integration, and ML services. However, determining the most suitable tables or datasets for search queries and ensuring their validity can be time-consuming. DATALORE resolves this by improving search results, refining table selection, and reducing users’ code-writing burden.
Its performance compares favorably with Explain-DaV (EDV), a method that creates data transformations to describe changes between two given datasets. DATALORE outperforms EDV in handling numeric, textual, and categorical data, and in transformations involving a join. It also excels in text-to-text and text-to-numerical transformations, but has room for improvement in numeric-to-numeric and numeric-to-categorical transformations.
In conclusion, Amazon’s DATALORE is an innovative ML system that aims to enhance data traceability and reproducibility, making collaboration between data scientists and engineers more streamlined and efficient. The system minimizes manual effort, improves data search results, and reduces users’ code-writing burden, providing a valuable tool for the data science community.