Phishing is a method employed by malicious actors to trick individuals into revealing sensitive data such as usernames, passwords, and credit card details. Various methods such as email, telephone, or text messages are used to present themselves as a trustworthy entity. Despite traditional approaches to detect email phishing, emerging trends are harder to handle with rule-based methods, necessitating the use of machine learning (ML) techniques.
Amazon Comprehend, a natural-language processing (NLP) service, uses ML to extract valuable insights and connections in text. This service has been customized to train and host an ML model to classify emails as potential phishing attempts. The model can identify the language of texts and detect key phrases, places, people, brands, or events.
The phishing detector, when used with email servers, flags potential phishing emails and warns the recipient with a banner, though the email still lands in their inbox. While perfect for experimentation, Amazon recommends building a training pipeline for commercial use.
The steps to create a phishing detection model using Amazon Comprehend include gathering and preparing the training data, loading it into an Amazon Simple Storage Service (Amazon S3) bucket, and creating the Amazon Comprehend custom classification model and its corresponding endpoint. Testing the model follows.
Training data should consist of both phishing and non-phishing emails and must have at least 10 examples per class. The model is trained in either single-label mode or multi-label mode. For plain-text models, training data can be in CSV or an augmented manifest file, created using Amazon SageMaker Ground Truth.
For efficiency in model creation, Amazon Comprehend tests the model. If there’s no test dataset, Amazon holds back 10% of the available data for testing. If a test dataset is provided, it must have at least one sample for each unique label in the training data.
Through the use of the Amazon API Gateway REST API with AWS Lambda integration, the phishing detection can be integrated into real-world applications.
Capping off this learning overview, the post advises users to delete the endpoint if it will no longer be used to stop additional costs from incurring. Resources including the Amazon Comprehend Developer Guide, GitHub repository, and other Amazon Comprehend assets are available for those interested to learn more about this service.