Detecting personally identifiable information (PII) in documents can be a complex task due to numerous regulations like the EU’s GDPR and multiple U.S. data protection laws. A flexible approach is needed given the variations in data formats and domain-specific requirements. In response, Gretel has developed a synthetic dataset to help with PII detection.
Gretel’s Navigator tool enables developers to create synthetic datasets which are more reflective of their specific needs, offering a more cost and time-efficient approach compared to traditional manual labeling techniques. One such dataset that has been launched by Gretel is the multilingual Financial Document Dataset.
The dataset includes extensive records with 55,940 records split into 50,776 training samples and 5,164 test samples. It covers 100 distinct financial document formats, with 20 exclusive subtypes for each format. 29 distinct types of synthetic PII are encompassed in line with the Python Faker library generators to facilitate easy detection and replacement. Multilingual support is available for languages including English, Spanish, Swedish, German, Italian, Dutch, and French. A unique technique, the LLM-as-a-Judge using the Mistral-7B language model, is used to ensure data quality and evaluate conformance, quality, toxicity, bias, and groundedness.
The dataset can be used to train Named Entity Recognition (NER) models, evaluate PII scanning systems, assess de-identification systems’ performance, and develop data privacy solutions for the financial industry. The dataset’s quality is ensured through the use of the LLM-as-a-Judge technique with each generated record evaluated on several criteria, including conformance, quality, toxicity, bias, and groundedness.
Gretel’s release of this dataset underlines their commitment to promoting open data and encouraging collaboration within the Artificial Intelligence community. They aim to speed up the development of trustworthy AI systems by sharing high-quality and ethically sourced datasets.
In conclusion, the synthetic financial document dataset developed by Gretel introduces relevant progress in PII detection. It enables developers to build more practical and domain-specific PII detection systems by providing a comprehensive, customizable data resource. It not only addresses the technical challenges linked to PII detection but also promotes data privacy and compliance across different industries. This will ensure sensitive data handling is both secure and responsible as AI continues to evolve.