Enhancing Protein Allergenicity Prediction Using CNNs: A Study Of Encoding Methods
View/ Open
Date
2024-12-20Author
Saroglaki, Christina IoannaPublisher
Πανεπιστήμιο Κύπρου, Σχολή Θετικών και Εφαρμοσμένων Επιστημών / University of Cyprus, Faculty of Pure and Applied SciencesPlace of publication
CyprusGoogle Scholar check
Keyword(s):
Metadata
Show full item recordAbstract
Over the last decades, a rising prevalence of allergies has been observed, affecting millions of people worldwide, especially in developed countries [1-3]. As a result, the need for accurate
allergen identification has emerged, especially in domains like food safety, pharmaceuticals and biotechnology [4]. To address this need, extensive research has been conducted to investigate the potential of artificial intelligence (AI) algorithms in enabling the fast and accurate prediction
of allergens. This thesis aims to contribute to this growing body of knowledge by investigating the effectiveness of several amino acid encoding methodologies for predicting protein allergenicity utilizing convolutional neural network (CNN) architectures. Using an imbalanced
dataset of allergenic and non-allergenic proteins, this work evaluates three distinct encoding methods: integer-based, hydrophobicity-based and transformer-based encodings.
The research process involved the development of a CNN predictive model and its subsequent performance optimization through various development phases. To mitigate the class imbalance – an inherent limitation due to the restricted knowledge about allergenic proteins –, this thesis implemented Focal Loss and applied k-fold cross-validation for a better evaluation of the models’ performance. To further refine the classification capabilities of the developed models, optimization efforts such as hyperparameter tuning and dropout regularization were also integrated into the process.
The experimental results showed that the transformer-based encodings managed to attain the highest predictive performance, underscoring their potential to promote advancements in the domain of allergenicity prediction. Additionally, the implementation of Focal Loss improved Matthew’s Correlation Coefficient (MCC) values, therefore ensuring a balanced predictive performance across the two classes. However, despite the high performance of both the transformer and hydrophobicity-based models, challenges regarding computational resource
requirements and model interpretability were also identified in this study.
These findings highlight the impact of amino acid encodings, as well as the model’s
architecture, on allergenicity predictive accuracy. In conclusion, this research demonstrated the effectiveness of CNN models in combination with various data encoding techniques for the prediction of allergenic proteins, contributing valuable insights to bioinformatics and supporting
developments in food safety and personalized medicine.
Collections
Cite as
The following license files are associated with this item: