Detecting phishing attacks using a combined model of LSTM and CNN

Article history: Received 10 December 2019 Received in revised form 30 March 2020 Accepted 1 April 2020 Phishing, a social engineering crime which has been existing for more than two decades, has gained significant research attention to find better solutions to face against the very dynamic strategies of phishing. The financial sector is the primary target of phishing, and there are many different approaches to combat phishing attacks. Software-based detection approaches are more prominent in phishing detection; however, still, there is no robust solution that can stable for a long period. The primary purpose of this paper is to propose a novel solution to detect phishing attacks using a combined model of LSTM and CNN deep networks with the use of both URLs and HTML pages. The URLs are learned using an LSTM network with 1D convolutional, and another 1D convolutional network is used to learn the HTML features. These two networks were trained separately and combined through a sigmoid layer by dropping the last layer of each model to have the proposed model. The proposed model reached 98.34% in terms of accuracy, and that is above the previously recorded highest accuracy of 97.3% among the detection models used both URL and HTML features in the explored literature. The solution requires feature extraction only with HTML pages, and URLs were directly fed with a minimum pre-processing. Although the proposed solution uses extracted HTML features, those do not depend on third-party services. Therefore, an efficient real-time application can be implemented using the proposed model to detect phishing attacks to safeguard Internet users.


Introduction
which is originated from the term fishing, is defined as impersonating a trusted third party to steal personal and confidential information from a victim (Whittaker et al., 2010). It was started in 1995 with the American Online (AOL) attack (Chiew et al., 2018b) and still exists as a significant cyber threat by having a top rank in the cyber threat landscape (ENISA, 2019). Phishing is highly associated with human intellect (Nirmal et al., 2015), and the financial gains are the primary motivation for this kind of attack. However, fame and notoriety is also an exciting psychological aspect of phishing (Weider et al., 2008). Phishing is a severe security problem today, and phishers are smart, economically motivated, and adaptable. The European Union Agency for Cybersecurity (ENISA) is ranked phishing within the top 4 out of 15 top cyber threats (ENISA, 2019).
Further, the Anti-Phishing Working Group (APWG) also identified more than 180,000 unique new phishing sites for the second quarter of 2019 (APWG, 2019). According to the APWG, nearly 22% of phishing attacks found in online payment systems, and next is the financial sector, and it is 18%. That means more than 40% of phishing attacks were reported in payment processors and banks. However, as a new trend, nearly 39% of attacks also reported in Software as a Service (SaaS) and cloud storage. All these facts claim that phishing is still an active threat.
In literature, there are different approaches to combat phishing attacks and, those are mainly categorized under two, namely, improving userawareness and software-based detection. However, the second approach, software-based detection, which is also used in this study, is having a high potential interest because it is a human-centric approach. There are different software-based detection approaches; among those, machine learning performs well due to the unique advantages of it.
Deep learning, a representation learning approach, is dominated the Artificial Intelligence (AI) field for the past few years (LeCun et al., 2015). It is very good at discovering complex structures in high dimensional data; therefore, deep learning applies to many domains, and it only requires minimal engineering by hand (LeCun et al., 2015). The study is also based on two well-known deep learning techniques; Long/Short Term Memory (LSTM) and Convolutional Neural Network (CNN). The proposed model used these two techniques to detect phishing attacks using both HTML and URL based features. The LSTM and 1D convolutional network are used to learn abstract level features in URLs by getting the website URL as the input. The specialty here is the URLs are used without any manual feature extraction. Those are directly fed to the network with a minimum pre-processing. The HTML features which extract from HTML pages through a feature extraction model are separately trained in a 1D convolutional network. Finally, the knowledge of both these two networks is combined through a dense layer with a sigmoid activation function to make final predictions. The proposed solution shows an average accuracy of 98.3% in detecting phishing attacks, and it is the highest recorded accuracy in a model which implemented using both URL and HTML features in the explored literature. The main contribution of this paper is a new deep network to detect phishing attacks in higher accuracy using both HTML features and URLs.
The rest of the paper is organized as follows. In Section 2, the paper discusses the overview of phishing and the detection approaches used in the past. Section 3 describes the proposed solution, and Section 4 explains how the experiment was done. Then the results obtained and the performance of the proposed solution is presented in Section 5. Finally, in Section 6, the paper concludes by mentioning some future directions.

State of the art
Phishing, the Internet-based attack or cyberattack which exists for more than two decades now (Chiew et al., 2018b), is an attempt by an individual or a group of people to steal personal or confidential information of a victim (Nguyen et al., 2014a). It is a social engineering crime (Whittaker et al., 2010), which is having a growing tendency during the last two decades (Li et al., 2019;APWG, 2019). Li et al. (2019) mentioned 1609 phishing attacks per month, which is now increased to more than 50,000 attacks per month in the 2 nd quarter of 2019 (APWG, 2019). The main reason behind such a tendency is the nature of the phishing attacks because these attacks do not remain for more extended periods; suddenly come and get the work done, then disappears. However, the complexity, confusing and, noising of these attacks make it hard to detect and challenge the researches to find a robust solution.

Overview of phishing attacks
Phishing attacks mainly used three strategies, namely, mimicking attack, forward attack, and popup attack (Chiew et al., 2018b). Mimicking attacks are frequent and used emails to send a fake URL to the victim as bait (Chiew et al., 2018b). Generally, phishing attacks are started with an impersonated legitimate web page (Li et al., 2019), which is very much similar to the legitimate web page (Adebowale et al., 2019). Further, phishing attacks consist of three main components as the medium of phishing, attack vector, and technical approaches (Chiew et al., 2018b). The medium of phishing can be the Internet, which is more popular, SMS, or Voice. Attack vectors are Email, Instant Message (IM), Social Networking, Website, and more. The technical approaches which are used to enhance the attack further are mainly two types; vulnerability exploitation on hardware or software and website related techniques, which is more prevalent in phishing (Chiew et al., 2018b).
Generally, a phishing attack is executed in six main steps: 1) the attacker constructs a fake website by finding a target brand and audience, 2) the URL of the fake website distributes to the audience through numerous spam emails, 3) user reads the email and act (i.e., click on the link) on it, 4) the user interacts with the fake website, 5) the attacker collects sensitive information, and 6) the collected information is used to satisfy attacker's intention. However, the life of a phishing cycle is concise, and half of the phishing attacks are being shut down in less than a day. Further, the average uptime of a phishing web page is 32.5 hours, as stated in the literature (Li et al., 2019).

Phishing detection approaches
Many methods have been developed to safeguard users from phishing attacks. Email filtering and web page or deceptive phishing detection are standard methods for such attack detection (Dou et al., 2017). However, the current study is primarily focused on web page phishing detection; therefore, the emailbased phishing filtering is not included in this paper as a phishing detection approach. In the past two decades, different technical and non-technical antiphishing solutions introduced to the community, and those solutions mainly into two categories; improving user-awareness and software-based detection (Khonji et al., 2013).

Improving user-awareness
Phishers are always taking advantage of inexperienced users to accomplish their intentions, and improving user-awareness is one solution to overcome this (Khonji et al., 2013). Dong et al. (2008) proposed a visual user phishing interaction model, which helps to identify the failures of users when interacting with the websites. The Anti-Phishing Phil (Sheng et al., 2007) was another solution introduced to practice good user habits in an interactive gaming environment. Similarly, Smells Phishy (Baslyman and Chiasson, 2016) was a gamebased attempt to improve user-awareness. Displaying warnings and notifications to the users are common in many browsers today, and the use of active warning rather than passive gives superior results in improving user-awareness (Egelman et al., 2008;Wu et al., 2006). Further, the training materials can be used to improve user-awareness (Khonji et al., 2013). Although improving userawareness shows some success, it is a machinecentric approach which is not practical and effective in the phishing domain (Khonji et al., 2013).

Software-based detection
A software-based detection is a human-centric approach that can be categorized into four categories, namely, blacklisting/whitelisting, rulebased heuristic, visual similarity, and machine learning (Khonji et al., 2013).
Blacklisting/whitelisting Techniques: A simple and commonly used approach depends on a list of phishing or legitimate web site URLs. Known phishing URLs list is referred to as blacklist and whitelist stores legitimate ones (El-Alfy, 2017). Google Safe Browsing API (https://safebrowsing.google.com/) is one such blacklist used in the present. Even though this is a simple approach, maintaining a black or white list mainly depends on reporting and confirmation of suspicious websites, which requires more time and effort (Jain and Gupta, 2016). Further, practical limitations such as the need for exact matching, failures in detecting zero-hour attacks, and maintaining an up-to-date list (Khonji et al., 2013;Jain and Gupta, 2016;El-Alfy, 2017), make this approach ineffective. PhishNet tool (Prakash et al., 2010), Automated Individual White-List (AIWL) (Cao et al., 2008), and White-List maintainer (Jain and Gupta, 2016) are few approaches used to overcome from some of the mentioned issues.
Rule-based Heuristic Techniques: This technique can detect zero-hour attacks (Khonji et al., 2013). However, as Khonji et al. (2013) stated, the risk of misclassifying legitimate websites is also high in this technique. SpoofGuard (Chou et al., 2004), uses a set of rules based on the features like domain name, URL, links, and images to detect phishing attacks. CANTINA (Zhang et al., 2007), a content-based approach, used the TF-IDF algorithm with six heuristics like age of the domain, known images, IP Address, and few more. CANTINA performs well compared to the SpoofGuard by having 90% accuracy, with only 1% of the false-positive rate (Zhang et al., 2007). PhishGuard (Joshi et al., 2008), another heuristic approach that is based on the HTTP digest authentication concept, used HTTP 200 OK and 401 unauthorized statuses when detecting phishing attacks. Similarly, Mohammad et al. (2014a) proposed an intelligent rule-based technique with 17 selected features. Although the rule-based heuristic approaches have good detection accuracy, problems such as high False Positive (FP) rate, predefined rules, cost of updating rules, and rapidly changing nature of phishing attacks (Khonji et al., 2013;Gupta et al., 2017) make this also ineffective.
Visual Similarity Techniques: Visual similarity techniques have used the appearance of the web page and mostly features like text content, text format, HTML tags, CSS, images, and more. DOMAntiPhish (Rosiello et al., 2007), one such technique, uses the Document Object Model (DOM) similarity between two pages through a defined function in detection. Nguyen et al. (2014b) proposed another DOM tree-based approach to overcome the arouse issues in Rosiello's approach through a two-way similarity comparison technique. PhishZoo (Afroz and Greenstadt, 2011), which used a profile based technique with an accuracy of 96.1%, is used URL of the website, SSL certificate, and web content like HTML, images, and scripts. It is a profile based technique.
Similarly, Huang et al. (2010) proposed a site signature approach, which creates a unique webbased signature using text and image-based features, and it shows 94% accuracy with a low error rate. Goldphish (Dunlop et al., 2010) is having the ability to detect zero-hour phishing attacks and shows better results compare to previous solutions. However, this solution is unstable because it depends on the logo image, OCR, and Google ranking (Adebowale et al., 2019;Jain and Gupta, 2017). Phishing-Alarm, a Cascading Style Sheet (CSS) based solution (Mao et al., 2017), uses CSS as the basis to measure the visual similarity. Likewise, several other approaches in visual similarity area like discriminative key point features which have a high degree of accuracy between 95% and 97% (Chen et al., 2009), Earth Mover's Distance (EMD) which works at the pixel level of the web pages with significant precision (Fu et al., 2006) and hybrid approaches in phishing detection  are also mentioned in literature. However, problems like accuracy issues, use of databases, failures in zero-hour attacks, embedded objects detection issues, and use of threshold value  are mentioned as drawbacks of this technique.
Machine Learning Techniques: An association rule mining approach was proposed in phishing detection by Jeeva and Rajsingh (2016). They had been used fourteen heuristic rules to extract features from URLs, and a total of 18 rules were generated to achieve 93% accuracy. Nguyen et al. (2014a) used six heuristics with a single-layer neural network to achieve 98% of accuracy. Although Nguyen et al. (2014a) achieved good accuracy, some of the used heuristics highly depend on third-party services. Phish-Safe (Jain and Gupta, 2018), which is based on Support Vector Machine (SVM), used 14 features and achieved the best detection accuracy of 90%. Sahingoz et al. (2019) compared seven different machine learning algorithms with three different feature vectors like word, Natural Language Processing (NLP) based, and hybrid to detect phishing URLs. The result shows that the Random Forest (RF) algorithm with NLP based features gives the best accuracy of 97.98%. Further, Probabilistic Neural Networks (PNNs) is used by El-Alfy (2017) to implement a classifier with 96.74% of detection accuracy. Although these mentioned approaches show some accuracy above 90%, all these approaches only depend on URLs and suffering from manual feature extraction.
As a solution for this manual feature extraction, deep learning techniques were tried out to implement automated feature extraction processes in the past. HTMLPhish (Opara et al., 2019) was such an attempt that used Recurrent Neural Network (RNN) to automated feature extraction process from HTML pages. It used only HTML pages in the detection process and achieved 97.2% detection accuracy. Further, Bahnsen et al. (2017) proposed an LSTM network-based solution with high precision. The solution only used URLs, and no manual feature extraction is required. The URLs were fed to the LSTM network after an encoding process, and it reduces the detection time. That was the first time LSTM was used in phishing detection, and it outperformed with 98.7% accuracy. After that, Chen et al. (2018) also used LSTM to detect phishing URLs, and they have achieved 99.1% of accuracy.
Further, Chen et al. (2018) reported that the CNN approach with the URLs has less accuracy compared to the LSTM. However, Pham et al. (2018) stated that a combination of CNN and LSTM could give better results in detecting malicious URLs rather than using only LSTM. Although high accuracy is maintained in these automated malicious URL detection systems, URL shortening services that can hide malicious URLs, benign URLs becoming malicious in the future, and tools which can simulate URLs to bypass these models can be a challenge to have an effective phishing detection in the long run (Sahoo et al., 2017).
To overcome such challenges, incorporating HTML features extracted from the web page content with URL features in phishing detection is a strategic approach which also studied in the literature. A self-structuring multilayer perceptron network was proposed to detect phishing attacks by Mohammad et al. (2014b) with 17 input features, including both HTML and URL features. The solution achieved 92.5% of detection accuracy. Similarly, Pratiwi et al. (2018) also proposed a neural network architecture with 18 input features with a low accuracy rate of 83.38%. Li et al. (2019) used Gradient Boosting Decision Tree (GBDT), XGBoost, and LightGBM in multiple layers with 8 URL and 12 HTML based features. That is the first stack model to detect phishing attacks and achieved 97.3% accuracy. Further, Subasi et al. (2017) used several machine learning algorithms in phishing detection, and out of all, RF outperformed with an accuracy of 97.36%. However, no one in the explored literature tried to incorporate HTML features with LSTM approached introduced by Bahnsen et al. (2017) to experiment whether it can provide a robust solution to overcome this social engineering crime.

Proposed solution
The overview of the proposed solution to detect phishing attacks is shown in Fig. 1. The data source contained URLs and HTML codes of web pages. The URLs are directly used as inputs to the model with a minimum pre-processing, and that is separately discussed in a below subsection. However, HTML features need to be extracted from the web pages. Therefore, a feature extraction model is used for the extraction before finalizing the model input features. After extracting the relevant features from the web pages, HTML features, and URLs concatenate to have input feature vectors for the detection model. Finally, the detection model will use the input feature vector and produce an output as legitimate or phishing. However, the detection model is a combination of two deep networks. It can analyze URLs and HTML features separately and combine both decisions in making the final output of the model. The major components included in the solution, namely, a feature extraction model and detection model, are introduced in the following subsections.

Feature extraction model
The URLs are directly used as inputs to the detection model after performing a minimum pre-processing on it. Therefore, the feature extraction model used only to extract HTML features. However, Fig. 1 shows that the URL is also used as an input to the feature extraction model. That is only to extract the website domain name to support the HTML feature extraction process. The model will extract 15 HTML features from a given web page, and those features are described below:  Number of hyperlinks (Jain and Gupta, 2016): Number of 'href' attributes relevant to <a> in a web page.  Number of null pointers (Jain and Gupta, 2016;Gu et al., 2013): Number of 'href' attributes with the value empty or '#' on a web page.  External link ratio (Gu et al., 2013;Jain and Gupta, 2016;Li et al., 2019): Ratio between total number of available hyperlinks and external links.  Personal data forms (Li et al., 2019;Gupta et al., 2017): Binary value is used to check whether a <form> tag with one or more <input> child tag available in a page.  Length of the HTML page (Li et al., 2019): HTML code will be taken as a string and calculate the length of it.  Internal form ratio (Chiew et al., 2019): Ratio between the available <form> tags and the number of form's action attribute has the same domain or relative path.  Abnormal form ratio (Chiew et al., 2019): Ratio between the available <form> tags and the number of form's action attribute contains a '#', 'about: blank' or an empty string.  External form ratio (Chiew et al., 2019): Ratio between the available <form> tags and number of form's action attribute contains a URL from an external domain.  Title tag (Chiew et al., 2019): Binary value is used to check whether <title> tag is used one time on the page inside the head area.  Title tag and brand name (Li et al., 2019): Binary value is used to check whether the <title> tag contains the URL brand name.

Detection model
The detection model consists of three submodels, as shown in Fig. 2. The two sets of features mentioned above, URL and HTML features, are used in the detection model. These two sets will train separately with two deep learning models and merged the outputs of the models with the concept of transfer learning to build the final model. Then the final model will train again with both sets of features and used directly to identify the phishing and legitimate web pages. The procedure of the proposed detection model is summarized in Table 1, and three sub-models will introduce intensely in the following subsections.

Table 1:
Steps of the proposed model to detect phishing attacks Step 1: Construction of the Data for the Model  URL will take as one input feature  HTML features will be extracted after going through a feature extraction model  Combine the URL and HTML features to construct the final input feature vector  Output label associate to the input feature vector will merge and create an input to the model Step 2: Division the Model Input into Input Vectors  Input vector one is created with URL and associated output label  Input vector two is created with HTML features and the output label Step 3: Model A Training  Input vector one is used with the 1D convolutional and LSTM model  URLs are pre-processed and used to train the model  Model is trained and saved on the disk Step 4: Model B Training  Input vector two is used with the 1D convolutional model  Model is trained and saved on the disk Step 5: Model C Training  Model A is loaded from the disk and remove the last sigmoid layer  Model B is loaded from the disk and remove the last sigmoid layer  Last output layers of Model A and B concatenated and used as the input for the Model C  Model C is trained and use a test set to evaluate the model Step 6: Make Predictions from the Model  Model input will be created with the unseen web page by following the first three procedures of step 1  The input will pass to the Model C  Model C will output whether the web page is phishing or legitimate

Model A: 1D convolutional and LSTM model
LSTM is proven to be that it is a powerful technique for detecting phishing URLs (Bahnsen et al., 2017;Chen et al., 2018). Further, Pham et al. (2018) have shown that the combination of 1D convolution layer and LSTM layer improves the accuracy, compared to the models that consider only LSTM layers in malicious URL detection. Therefore, this study selected 1D convolutional and LSTM architecture to train the URL features when designing the Model A. In this work, first, pre-processing of the URL is required. Each character of the URL was considered as a word and gave a unique integer value to those words using Python's printable class in the string package. It is sufficient at this level since all the selected URLs are in English. Then to make all URLs in the same size, URLs were chopped into one size, and the size was decided by analyzing the URLs' character length distribution. Fig. 3 shows the URL character length distribution for legitimate and phishing URLs. Therefore, the maximum URL character length was selected as 150, and the URLs which had lesser characters were padded with 0.

Fig. 3: Character length distribution of the URLs
Model A was designed as a feed-forward network, and it contains an input layer, embedding layer, 1D convolution layer, pooling layer, LSTM layer, and output layer. Pre-processed URLs are passed as inputs to the model, and it defines the initial input shape. Then the input character is translated by a 256-dimension embedding in the embedded layer. Next, the translated URLs are fed into the 1D convolution layer through a chaining approach, and the layer uses ReLU as the activation function. Then as a common approach, the pooling layer is used at the end of the convolution part. The output of the convolution part is fed next to the LSTM layer, which is having a hyperbolic tangent (tanh) activation function with an output size of 32. The output layer of the model is designed with a dense layer with one neuron and sigmoid activation function, and it is where the actual classification takes place; therefore, the LSTM layer output is fed to the output layer to perform the classification task. The network uses binary cross-entropy as the loss function with Adam optimizer, and dropouts are used in each hidden layer. Fig. 4 shows a summary of model A.

Model B: 1D convolutional model
Model B is designed to train the HTML features, and it is a simple 1D convolutional network. It also uses a multilayer perceptron approach and contained an input layer, two 1D convolution layers, pooling layer, flatten layer, dense layer, and output layer. The inputs are first converted to a floatingpoint value and pass to the model for the shaping. Then input goes through two 1D convolution layers, which used ReLU as the activation function. Then the pooling and flatten layers are activated and passed the output to a dense layer, which has 32 neurons. The dense layer uses ReLU as the activation function, and the output of the layer is fed to the output layer of the model, which is also a dense layer with one neuron and sigmoid activation function. Similar to the Model A, Model B also uses binary cross-entropy as the loss function with Adam optimizer, and dropouts are used after each convolution layer. Fig. 5 shows the summary of the model B.

Model C: Prediction model
Model C is designed with the concept of transfer learning. Model A and B are separately trained and load to the Model C. Then, the output layers of Model A and B are removed. Then the final layer of the Model A is the LSTM layer, and Model B is the dense layer. Both final layers have 32 outputs each, and those outputs are concatenated to use as input to the Model C. Model C is a simple network with one dense layer. The layer has one neuron, and it uses the sigmoid activation function. After sufficient training, Model C is used for the prediction task.

Experiment and evaluation
The experiment is performed on an HP ProBook machine with 8 GB of memory, an Intel Core i5-7200U CPU @ 2.50GHz x2 processor. Keras neuralnetwork library on top of TensorFlow and Python programming language, are used in all implementation tasks.

Data source
The experiment used a self-constructed data source with 40000 data. The data source consisted of 20000 legitimate and 20000 phishing web pages with relevant URLs. The legitimate web pages were collected from the Google search engine through a Python script. The script can handle the duplicates, and the top-ranked web pages were selected based on the Google page ranking to have a trusted, legitimate set. Further, the script used a word list from GitHub and a self-generated list while executing the searching task. The phishing web pages with URLs were collected from several sources, mainly, PhishTank (https://www.phishtank.com/) and the phishing web site data source (Chiew et al., 2018a) of the University Malaysia Sarawak available in the University official link (http://www.fcsit.unimas. my/research/legit-phish-set/). Further, the data collected except PhishTank were verified using either PhishTank or Google Safe Browsing API to construct an accurate phishing data source. Therefore, all the data used in the phishing data source are either available in PhishTank or Google Safe Browsing API or both. The final data source was constructed in CSV format after the feature extraction model was extracted 15 HTML features, by merging relevant URLs and class labels. Then the CSV file, which contains 17 columns (15 HTML features + URL + class label) and 40000 rows, were divided randomly using the scikit-learn python library to have three separate data sources for training, testing, and validation. The proportions used for training, testing, and validation are 70%, 20% and, 10%, respectively.

Performance metrics
Phishing detection is a classification problem. Therefore, the confusion matrix approach is the best way to summarize the predictions to evaluate the performance of the proposed solution. The confusion matrix relevant to the study is shown in Fig. 6.   Fig. 6: Confusion matrix used during the study Each feature vector is fallen into one of the four possible categories mentioned in Fig. 6. The True Positive (TP) category contains the correctly predicted phishing pages, and True Negative (TN) is for correctly predicted legitimate pages. Then, False Negative (FN) and False Positive (FP) are the categories where the incorrect classification is happening. The FP contains legitimate pages predicted as phishing, and in FN, phishing pages are predicted as legitimate. Phishing detection is highly sensitive to false positives because if a single prediction falls into that category may cost more due to the nature of the phishing attacks.
The standard measures, such as accuracy, precision, recall, and f1-score, are used in this study to evaluate the proposed solution's performance. The mentioned metrics are described in the Eqs. 1-4.
Further, the Receiver Operating Characteristic (ROC) curve, which is useful when predicting the probability of a binary classification task, is also used with Area Under the Curve (AUC) to evaluate the proposed solution's performance statistically.

Training and evaluation
Model A and B were trained separately for 100 epochs with a batch size of 64 under 0.001 learning rate and saved to the disk. Then the training of Model C was started. It trained in a 50-step sequence with a learning rate of 0.001. The three data sources mentioned above were used in the experiment, and the training source was used for training, and the test source was used for internal validation. Fig. 7 shows the final model accuracy and loss, respectively, in each epoch for both training and testing data sources. After analyzing the graphs, it was shown that the performance on a validation data set starts to degrade before ten epochs. That is an indication of an overfitting scenario. Therefore, the early stopping technique was used to stop the training of the model early before it has overfitted the training data set. After the model successfully fit, 10% of data reserved for validation was used to evaluate the model performance. Model C was trained and evaluated three times using different data set for each time in the same proportions as mentioned above for training, testing, and validation to have a less biased model at the end. The scikitlearn model selection is used with different random states in this task. The results obtained through the experiments are discussed in the next section. Test Accuracy Fig. 7: Model accuracy and loss in each epoch before early stopping were used which shows that the model was overfitted before ten epochs were completed

Results and discussion
The results obtained during the study are shown in Table 2, based on the performance metrics, as mentioned above. As shown in Table 2, the average accuracy, precision, recall, and F1 are 98.34%, 98.45%, 98.23%, and 98.29%. Further, the model achieved 99.8% average AUC in ROC curve. These metrics' values indicate that the model is well suited for detecting phishing attacks. In order to illustrate the accuracy of the proposed solution in a more precise way, several methods were used with the experimental data source with different feature sets. The result of the experiment is shown in Table 3. The results show that the proposed model is outperformed compare to the other methods with the data source by achieving high prediction accuracy.  (2019) to detect phishing attacks using both HTML and URL features, and it had an accuracy of 97.3%. That is the best model found in the literature to compare the model presented in this paper since both used HTML features and URLs in phishing detection. The model presented here has several advantages over the benchmarked model. The detection accuracy is improved by 1.0%, and it is one advantage. Although both models have the HTML feature extraction process, the presented model is not using any URL feature extraction with the use of expert knowledge, which is another benefit getting over the benchmarked model. The latest approach introduced to the phishing area is the HTMLPhish (Opara et al., 2019). It achieved the detection accuracy of 97.2%, and that accuracy is also low compared to proposed model accuracy. However, HTMLPhish is not using any manual feature extraction. That is a drawback of the proposed solution since it used manual feature extraction from the HTML pages. Although the model used manual HTML feature extraction, incorporating URLs with the solution added some benefits to the model over HTMLPhish to have better accuracy. URL attempts, which can be produced by smart phishers. Further, Table 3 is a perfect showcase of how well the experimental data source performed with the different types of detection methods, which is possible to have in phishing detection. It indicates that the use of both URL and HTML content analysis is increased the detection accuracy than using only URLs or HTML features.

Conclusion and future works
In this work, a novel approach to detect phishing attacks was introduced. The solution depends on HTML content and URL of a web site. The URLs were trained in the LSTM network and the 1D convolutional network. The network used URLs as input, and expert knowledge is not required for URL feature extraction. Another 1D convolutional model was used to train HTML features, and the HTML features were extracted using a feature extraction model. Finally, these two networks were trained separately and combined through a sigmoid layer by dropping the last layer of each model to have the proposed classifier. The experiment used a selfconstructed data source with 20000 phishing and 20000 legitimate data. The phishing data mainly collected from the PhishTank and phishing web site data source of the University Malaysia Sarawak. Expect for PhishTank data; other collected phishing data were validated either by PhishTank or Google Safe Browsing API to have an accurate phishing data source. Legitimate data was collected through the Google search engine by running a Python script. The experiment used three partitions of the data source as training, testing, and validation. The proportions used in each partition are 70%, 20%, and 10% respectively. The scikit-learn python library is used in data partitioning, and the experiment was done three times to have a less biased model at the end.
The proposed model reached 98.34% in terms of accuracy rate and 99.8% AUC value in the ROC curve. This is the highest accuracy achieved by a phishing detection solution that used both HTML and URLs in the explored literature. Further, the experimental data source was used with few different possible detection methods, and the proposed solution selected as the best by emphasizing both HTML features and URLs is essential in phishing detection. One great advantage incorporates with the solution is eliminating expert interaction for feature extraction in URLs. However, HTML pages are still suffering from expert knowledge, which should be eliminated in the future to have a robust model in phishing detection. Therefore, future studies need to be carried out to overcome that drawback, and if that is a success, then a self-learning model can be implemented to detect phishing attacks without human interaction. Then time to time, the model can do self-learning to update the detection criteria automatically to become a useful model in the rapidly changing nature of phishing. However, the used HTML features do not depend on third-party services. Therefore, real-time applications can be implemented using the proposed model to detect phishing attacks. Several optimization techniques can be used to improve the accuracy, and different HTML feature sets also can be used as future works to check whether the proposed architecture can finetune more.