Phishing Detection Using Machine Learning: A Model Development and Integration

Valentine Adeyemi Onih
Department of Cybersecurity, University of Hertfordshire, Hertfordshire, United Kingdom
DOI – http://doi.org/10.37502/IJSMR.2024.7403

Abstract

This study aimed to develop a robust machine learning-based phishing detection system using algorithms such as K-nearest neighbour (KNN), artificial neural network (ANN), and random forest (RF). It utilised datasets from Ariyadasa et al. (2021) and UNB (2016) to discern patterns distinguishing legitimate from phishing websites. Furthermore, an objective was to integrate the optimal model into a Django-based web application, facilitating real-time phishing detection. A comprehensive literature review on phishing detection techniques was also undertaken.

Datasets chosen underwent rigorous pre-processing to address missing values and imbalance. Feature selection was achieved manually and automatically using mutual information classification. Three machine learning algorithms, RF, KNN, and ANN, were explored. Their hyper-parameters were optimised using GridSearchCV.

Performance results highlighted RF’s accuracy at 99.78%, KNN’s at 99.67%, and ANN’s at 99.11%. While RF and KNN models perfectly identified legitimate websites, ANN showcased an impeccable detection of phishing websites. The RF model, with the highest accuracy, was integrated into a Django application, providing a user interface for real-time phishing detection.

All models exhibited high accuracy rates, demonstrating their efficacy in phishing detection. While RF was integrated into the web application for this study, the choice between models depends on specific user or business requirements and priorities. Feedback mechanisms within the Django application further promise refinement in future recommendations. The study provides a foundational step toward enhancing web safety through effective phishing detection.

Keywords: RF, ANN, KNN, datasets, web, Django.

References

  • Domingues, M. Filippone, P. Michiardi, and J. Zouaoui, “A comparative evaluation of outlier detection algorithms: Experiments and analyses,” Pattern Recognition, vol. 74, pp. 406–421, Feb. 2018, doi: https://doi.org/10.1016/j.patcog.2017.09.037.
  • Vabalas, E. Gowen, E. Poliakoff, and A. J. Casson, “Machine learning algorithm validation with a limited sample size,” PLOS ONE, vol. 14, no. 11, p. e0224365, Nov. 2019, doi: https://doi.org/10.1371/journal.pone.0224365.
  • Kulkarni and Leonard Brown, “Phishing Websites Detection using Machine Learning,” Computer Science Faculty Publications and Presentations, vol. 20, Aug. 2019, Available: https://scholarworks.uttyler.edu/compsci_fac/20/
  • Flath and N. Stein, “Towards a data science toolbox for industrial analytics applications,” Computers in Industry, vol. 94, pp. 16–25, Jan. 2018, doi: https://doi.org/10.1016/j.compind.2017.09.003.
  • Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data imbalance in classification: Experimental evaluation,” Information Sciences, vol. 513, pp. 429–441, Mar. 2020, doi: https://doi.org/10.1016/j.ins.2019.11.004.
  • Namvar, M. Siami, F. Rabhi, and M. Naderpour, “Credit risk prediction in an imbalanced social lending environment,” arxiv.org, Apr. 2018, Available: https://arxiv.org/abs/1805.00801
  • Blagus and L. Lusa, “SMOTE for high-dimensional class-imbalanced data,” BMC Bioinformatics, vol. 14, no. 1, Mar. 2013, doi: https://doi.org/10.1186/1471-2105-14-106.
  • Brownlee, “Random Oversampling and Undersampling for Imbalanced Classification,” Machine Learning Mastery, Jan. 14, 2020. https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/
  • Cai, J. Luo, S. Wang, and S. Yang, “Feature selection in machine learning: A new perspective,” Neurocomputing, vol. 300, pp. 70–79, Jul. 2018, doi: https://doi.org/10.1016/j.neucom.2017.11.077.
  • Srilatha , A. Ajith, and T. Johnson P. , “Feature deduction and ensemble design of intrusion detection systems,” Computers & Security, vol. 24, no. 4, pp. 295–307, Jun. 2005, doi: https://doi.org/10.1016/j.cose.2004.09.008.
  • Abiodun, A. Alabdulatif, O. I. Abiodun, M. Alawida, A. Alabdulatif, and R. S. Alkhawaldeh, “A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities,” Neural Computing and Applications, vol. 33, no. 22, pp. 15091–15118, Aug. 2021, doi: https://doi.org/10.1007/s00521-021-06406-8.
  • Django, “The Web framework for perfectionists with deadlines | Django,” Djangoproject.com, 2019. https://www.djangoproject.com/
  • [13]S. J. Rigatti, “Random Forest,” Journal of Insurance Medicine, vol. 47, no. 1, pp. 31–39, Jan. 2017, doi: https://doi.org/10.17849/insm-47-01-31-39.1.
  • Biau and E. Scornet, “A random forest guided tour,” TEST, vol. 25, no. 2, pp. 197–227, Apr. 2016, doi: https://doi.org/10.1007/s11749-016-0481-7.
  • Scikit-Learn, “scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation,” Scikit-learn.org, 2019. https://scikit-learn.org/
  • Angione, E. S. Silverman, and E. Yaneske, “Using machine learning as a surrogate model for agent-based simulations,” Using machine learning as a surrogate model for agent-based simulations, vol. 17, no. 2, pp. e0263150–e0263150, Feb. 2022, doi: https://doi.org/10.1371/journal.pone.0263150.
  • Belete and M. D. Huchaiah, “Grid search in hyperparameter optimization of machine learning models for prediction of HIV/AIDS test results,” International Journal of Computers and Applications, pp. 1–12, Sep. 2021, doi: https://doi.org/10.1080/1206212x.2021.1974663.
  • Eko , B. Achmad , and B. Fitra , “Optimization of K Value in KNN Algorithm for Spam and Ham Email Classification | Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi),” www.jurnal.iaii.or.id, vol. 4, no. 2, Apr. 2020, Accessed: Jul. 24, 2023. [Online]. Available: http://www.jurnal.iaii.or.id/index.php/RESTI/article/view/1845
  • Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “KNN Model-Based Approach in Classification,” On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE, vol. 2888, pp. 986–996, 2003, doi: https://doi.org/10.1007/978-3-540-39964-3_62.
  • Scikit-Learn, “sklearn.neighbors.KNeighborsClassifier,” scikit-learn. http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html (accessed Jul. 24, 2023).
  • Zou, Y. Han, and S.-S. So, “Overview of Artificial Neural Networks,” Methods in Molecular BiologyTM, vol. 458, pp. 14–22, 2008, doi: https://doi.org/10.1007/978-1-60327-101-1_2.
  • Mishra and M. Srivastava, “A view of Artificial Neural Network,” 2014 International Conference on Advances in Engineering & Technology Research (ICAETR – 2014), Aug. 2014, doi: https://doi.org/10.1109/icaetr.2014.7012785.
  • Keras, “Home – Keras Documentation,” Keras.io, 2019. https://keras.io/
  • Google, “TensorFlow,” TensorFlow, 2019. https://www.tensorflow.org/
  • Liu, J. Bernstein, M. Meister, and Y. Yue, “Learning by Turning: Neural Architecture Aware Optimisation,” proceedings.mlr.press, Jul. 01, 2021. http://proceedings.mlr.press/v139/liu21c.html
  • Zhou, A. H. Gandomi, F. Chen, and A. Holzinger, “Evaluating the Quality of Machine Learning Explanations: A Survey on Methods and Metrics,” Electronics, vol. 10, no. 5, p. 593, Mar. 2021, doi: https://doi.org/10.3390/electronics10050593.
  • Paturi, L. Swathi, K. Sai. Pavithra, R. Mounika, and Ch. Alekhya, “Detection of Phishing Attacks using Visual Similarity Model,” IEEE Xplore, May 01, 2022. https://ieeexplore.ieee.org/document/9793231 (accessed Jul. 22, 2023).
  • Cheng, F. Liu, and D. D. Yao, “Enterprise data breach: causes, challenges, prevention, and future directions,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 7, no. 5, p. e1211, 2017, doi: https://doi.org/10.1002/widm.1211.
  • Tang and Q. H. Mahmoud, “A Survey of Machine Learning-Based Solutions for Phishing Website Detection,” Machine Learning and Knowledge Extraction, vol. 3, no. 3, pp. 672–694, Aug. 2021, doi: https://doi.org/10.3390/make3030034.
  • Shewan, “10 Companies Using Machine Learning in Cool Ways,” Wordstream.com, 2017. https://www.wordstream.com/blog/ws/2017/07/28/machine-learning-applications
  • Hinneburg, C. C. Aggarwal, and D. A. Keim, “hat is the nearest neighbor in high dimensional spaces?,” 2000. http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-70224
  • Cui, “Introduction to the K-Means Clustering Algorithm Based on the Elbow Method,” Introduction to the k-means clustering algorithm based on the elbow method, 2020, doi: https://doi.org/10.23977/accaf.2020.010102.
  • UNB, “URL 2016 | Datasets | Research | Canadian Institute for Cybersecurity | UNB,” www.unb.ca, 2016. https://www.unb.ca/cic/datasets/url-2016.html
  • Subhash, S. Fernando, and S. Fernando, “Phishing websites dataset,” 2021.
  • Shahrivari, M. M. Darabi, and M. Izadi, “Phishing Detection Using Machine Learning Techniques,” arXiv:2009.11116 [cs, stat], Sep. 2020, Available: https://arxiv.org/abs/2009.11116
  • K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishing detection from URLs,” Expert Systems with Applications, vol. 117, pp. 345–357, Mar. 2019, doi: https://doi.org/10.1016/j.eswa.2018.09.029.
  • Jain and B. B. Gupta, “A machine learning based approach for phishing detection using hyperlinks information,” Journal of Ambient Intelligence and Humanized Computing, vol. 10, no. 5, pp. 2015–2028, Apr. 2018, doi: https://doi.org/10.1007/s12652-018-0798-z.
  • Gandotra and D. Gupta, “An Efficient Approach for Phishing Detection using Machine Learning,” Multimedia Security, pp. 239–253, 2021, doi: https://doi.org/10.1007/978-981-15-8711-5_12.
  • Basit, M. Zafar, X. Liu, A. R. Javed, Z. Jalil, and K. Kifayat, “A comprehensive survey of AI-enabled phishing attacks detection techniques,” Telecommunication Systems, vol. 76, no. 1, Oct. 2020, doi: https://doi.org/10.1007/s11235-020-00733-2.
  • ZScaler, “2023 Phishing Report Reveals 47.2% Surge in Phishing Attacks Last Year,” Zscaler, 2023. https://www.zscaler.com/blogs/security-research/2023-phishing-report-reveals-472-surge-phishing-attacks-last-year
  • Jain and B. B. Gupta, “PHISH-SAFE: URL Features-Based Phishing Detection System Using Machine Learning,” Advances in Intelligent Systems and Computing, pp. 467–474, 2018, doi: https://doi.org/10.1007/978-981-10-8536-9_44.
  • Safi and S. Singh, “A systematic literature review on phishing website detection techniques,” Journal of King Saud University – Computer and Information Sciences, Jan. 2023, doi: https://doi.org/10.1016/j.jksuci.2023.01.004.
  • L. Chiew, C. L. Tan, K. Wong, K. S. C. Yong, and W. K. Tiong, “A new hybrid ensemble feature selection framework for machine learning-based phishing detection system,” Information Sciences, vol. 484, pp. 153–166, May 2019, doi: https://doi.org/10.1016/j.ins.2019.01.064.
  • Atlam and O. Oluwatimilehin, “Business Email Compromise Phishing Detection Based on Machine Learning: A Systematic Literature Review,” Electronics, vol. 12, no. 1, p. 42, Dec. 2022, doi: https://doi.org/10.3390/electronics12010042.
  • -Y. Wu, C.-C. Kuo, and C.-S. Yang, “A Phishing Detection System based on Machine Learning,” IEEE Xplore, Aug. 01, 2019. https://ieeexplore.ieee.org/document/8858325 (accessed Mar. 17, 2022).
  • A. P. Delzell, S. Magnuson, T. Peter, M. Smith, and B. J. Smith, “Machine Learning and Feature Selection Methods for Disease Classification With Application to Lung Cancer Screening Image Data,” Frontiers in Oncology, vol. 9, Dec. 2019, doi: https://doi.org/10.3389/fonc.2019.01393.
  • Hannousse and S. Yahiouche, “Towards benchmark datasets for machine learning based website phishing detection: An experimental study,” Engineering Applications of Artificial Intelligence, vol. 104, p. 104347, Sep. 2021, doi: https://doi.org/10.1016/j.engappai.2021.104347.
  • M. Yadollahi, F. Shoeleh, E. Serkani, A. Madani, and H. Gharaee, “An Adaptive Machine Learning Based Approach for Phishing Detection Using Hybrid Features,” IEEE Xplore, Apr. 01, 2019. https://ieeexplore.ieee.org/document/8765265
  • Almseidin, M. Alkasassbeh, M. Alzubi, and J. Al-Sawwa, “Cyber-Phishing Website Detection Using Fuzzy Rule Interpolation,” Cryptography, vol. 6, no. 2, p. 24, May 2022, doi: https://doi.org/10.3390/cryptography6020024.
  • Khadidos, S. Shitharth, A. O. Khadidos, K. Sangeetha, and K. H. Alyoubi, “Healthcare Data Security Using IoT Sensors Based on Random Hashing Mechanism,” Journal of Sensors, vol. 2022, pp. 1–17, Jun. 2022, doi: https://doi.org/10.1155/2022/8457116.
  • Ozker and O. K. Sahingoz, “Content Based Phishing Detection with Machine Learning,” 2020 International Conference on Electrical Engineering (ICEE), Sep. 2020, doi: https://doi.org/10.1109/icee49691.2020.9249892.
  • Puri, P. Saggar, A. Kaur, and P. Garg, “Application of ensemble Machine Learning models for phishing detection on web networks,” IEEE Xplore, Jul. 01, 2022. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9913599 (accessed May 11, 2023).
  • S. Zaini et al., “Phishing detection system using nachine learning classifiers,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 17, no. 3, p. 1165, Mar. 2020, doi: https://doi.org/10.11591/ijeecs.v17.i3.pp1165-1171.
  • Cuzzocrea, F. Martinelli, and F. Mercaldo, “A machine-learning framework for supporting intelligent web-phishing detection and analysis,” Proceedings of the 23rd International Database Applications & Engineering Symposium on – IDEAS ’19, 2019, doi: https://doi.org/10.1145/3331076.3331087.
  • Orunsolu, A. S. Sodiya, and A. T. Akinwale, “A predictive model for phishing detection,” Journal of King Saud University – Computer and Information Sciences, Dec. 2019, doi: https://doi.org/10.1016/j.jksuci.2019.12.005.
  • M. Uddin, K. Arfatul Islam, M. Mamun, V. K. Tiwari, and J. Park, “A Comparative Analysis of Machine Learning-Based Website Phishing Detection Using URL Information,” IEEE Xplore, Aug. 01, 2022. https://ieeexplore.ieee.org/document/9904055 (accessed Dec. 14, 2022)
  • Mughaid, S. AlZu’bi, A. Hnaif, S. Taamneh, A. Alnajjar, and E. A. Elsoud, “An intelligent cyber security phishing detection system using deep learning techniques,” Cluster Computing, May 2022, doi: https://doi.org/10.1007/s10586-022-03604-4.
  • Abdelhamid, F. Thabtah, and H. Abdel-jaber, “Phishing detection: A recent intelligent machine learning comparison based on models content and features,” 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), Jul. 2017, doi: https://doi.org/10.1109/isi.2017.8004877.
  • Singh, I. Al-Mahmood, and M. Al-Tahsin, “Novel Approach to Secure Websites with Machine Learning Classifiers,” Asian Journal of Social Science and Management Technology Asian Journal of Social Science and Management Technology, vol. 4, no. 2, pp. 2313–7410, 2022, Accessed: Mar. 04, 2024. [Online]. Available: http://www.ajssmt.com/Papers/42152176.pdf
  • K. Gyamfi and J.-D. Abdulai, “Bank Fraud Detection Using Support Vector Machine,” 2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Nov. 2018, doi: https://doi.org/10.1109/iemcon.2018.8614994.
  • Zscaler, “2023 Zscaler ThreatLabz State of Phishing Report | Zscaler,” info.zscaler.com, 2023. https://info.zscaler.com/resources-industry-reports-threatlabz-phishing-report
  • Proofpoint, “2023 State of the Phish Report – Stats, Trends & More | Proofpoint AU,” Proofpoint, Feb. 22, 2021. https://www.proofpoint.com/au/resources/threat-reports/state-of-phish
  • Internet Crime Complaint Center, “2020 Internet Crime Report,” FBI, 2020. Available: https://www.ic3.gov/Media/PDF/AnnualReport/2020_IC3Report.pdf