Spam email classification using convolutional neural networks with Hessian-Free optimisation

In this dissertation we attempt to solve the Email Classification problem with a novel method using a second-order function with a Convolutional Neural Network (CNN). As far as the literature is concerned, currently there is no other method that uses Hessian Free Optimisation with CNN to solve the Email Classification problem. We use CNN with Hessian Free Optimisation to distinguish between spam emails and ham (legitimate) emails. Word Embedding is applied to the data to convert them to a numerical form that the Neural Network model can understand. The Word Embedding we use is the Wor2Vec, and we achieve very satisfactory results. Furthermore, we use cross-validation to verify the model’s good accuracy. We split the data five-fold, and used in total six different datasets. We compare the model with other authors’ works and a classic Convolutional Neural network with Gradient Descent (GD) which we also implement in this dissertation. We measure the efficacy of each model by calculating the Accuracy, and Spam/Ham Recall. The accuracy measurement was used just for the CNN with GD since the aforementioned authors only provided Ham and Spam Recall measurements. We use the entire dataset for training when we compare this model with other authors’ work. We achieve accuracy of 99.199%, and 97.39%, 99.94% for Spam and Ham Recall for the first dataset respectively. For the second dataset, we achieve accuracy of 99.227% and 96.98% and 100%, Spam and Ham Recall. The accuracy was 99.848% in the third dataset, and the Spam, Ham Recalls were 99.59% and 99.94%. The accuracy of the fourth dataset was 99.333%, with 99.58% for Spam and 98.59% for Ham Recall. For the fifth dataset, accuracy was 99.061%, with 98.69% for Spam and a perfect score (100%) for Ham Recall. Finally in the sixth dataset, the accuracy was 98.997%, with 98.93% Spam and 99.19% Ham Recall. Lastly, we performed cross-validation and the average validation accuracy for each dataset was: Dataset 1 99.078%, Dataset 2 99.158%, Dataset 3 99.772%, Dataset 4 99.240%, Dataset 5 98.762% and Dataset 6 98.846%. The average Spam and Ham Recall for each dataset was similar to the ones mentioned in the previous paragraph, but we achieved two perfect scores in Ham Recall in the second and the fifth dataset. All other Spam and Ham Recalls from our implementation were between 96.72% and 99.88%. We also applied cross-validation for CNN with Gradient Descent, but the highest accuracy achieved was 76% in Dataset 4, and the lowest was in Dataset 5. The first three datasets have 0% Spam Recall, and the last three datasets have 0% Ham Recall.CNNs with Hessian Free Optimization do not just have better accuracy and ham/spam recall in every dataset, but also the model converges faster than the Gradient Descent model.We measure that with our best model; the HFO model converges 2.5 times faster than the Gradient Descent using the same dataset.

URI

http://gnosis.library.ucy.ac.cy/handle/7/65119

Collections

Τμήμα Πληροφορικής / Department of Computer Science [110]

Cite as

The following license files are associated with this item:

Creative Commons

Except where otherwise noted, this item's license is described as CC0 1.0 Universal