Nominate of significant features for unknown internet traffic applications filtering based on a neural network algorithm

The evolution of the internet into a large, complex service-based network has posed tremendous challenges for network monitoring and control in terms of how to collect massive volumes of data, in addition to the accurate classification of new emerging applications, such as peer-to-peer networks, streaming content and online games. In this work, machine learning algorithms are used for the classification of traffic into their corresponding applications. Furthermore, this research uses our customized training data set collected from the three institutions' campuses. The effect on the size of the training data set has been considered before examining the accuracy of various classification algorithms and selecting the best from a large amount of data traffic in the network, which has led to delays in performance; therefore, to solve this problem we suggested a distinct approach using multiple neural networks with the feature selection in order to predict and identify known and unknown applications. By applying the proposed method, we get excellent accuracy in the classification of data traffic in the network of up to 99.11%, which leads to improved data traffic in the network and avoids delays.


Introduction
*Internet traffic describes the sum of data or knowledge available on a site or in another language, which we may claim is a flow of data across the web.
The classification of internet traffic has the ability to handle different types of network problems (Namdev et al., 2015), such as the management of internet traffic applications by identifying an application's basic functionality that has been given to polity. The various solutions include advanced network monitoring, management of network resources, and detection of anomalies, devicespecific strategies, and network audits. Therefore, the awareness of the internet at the application level is extremely useful to those who design internet traffic and research long-term internet changes and conditions. In principle, more than 20 years ago, internet measurements found that 70-75% of traffic was online. For various network operations, accurate identifications and predictions of internet-traffic are important, such as for network management and security control, traffic analysis and network preparation, performance, and accounting services delivery (Gunnar et al., 2005). Now, sharing files and applications across the web is often the major influence on data traffic in the network. In order to determine the known and unknown requests in the network (Nguyen and Armitage, 2008) in a given traffic dataset, machine learning can automatically search for and identify useful structural patterns. We investigate the use of multiple neural network algorithms to classify internet traffic.
The advantage of the proposed method is that the model can predict incoming applications and classify them into known and unknown applications in order to reduce web traffic with more accuracy than previous research in the literature.
This article is structured into five parts: Part One is about the motivation and includes an introduction; Part Two addresses the literature linked to the research and process enhancement; Part Three will explain the fundamental structure. The trial architecture and research results, and comprehensive descriptions are to be found in Part Four. The work's conclusions are listed in Part Five.

Related works
Several methods of traffic clustering have been suggested and tested using stream statistical functionality (Zhang et al., 2013a). McGregor et al. (2004) have suggested aggregating traffic flows into a small number of clusters through the maximization of standards. This is in addition to the statistical characteristics of the algorithm and several full-flow bases. It was found that the algorithm separates traffic into a minimal number of clusters based on the type of traffic rather than the application. Zander et al. (2005) used Auto Class to band traffic flows and suggested a metric for cluster evaluation called intra-class homogeneity. The training method was conducted on a random sampling sub-set of traffic data. The effects of the clustering are tested in terms of accuracy. Bernaille et al. (2006) used a payload analysis tool where the K-means algorithm was applied to traffic clusters and clusters that were marked for applications. The researchers used the first few packets of transmission control protocol (TCP) and the complete stream to describe traffic flows as numerical functions. Erman et al. (2006) evaluated the traffic clustering algorithms, K-means, density-based spatial clustering of applications with noise (DBSCAN), and Auto Class on two empirical traces of data. The authors concluded that the Kmeans algorithm was more suited to traffic clustering because of its strong overall accuracy and short model construct time. Previous research has shown that where the number of clusters is considerably greater than the number of specifications, traffic clusters may produce highpurity clusters. However, it also contributes to a key problem in the distance between clusters and applications. The payload-based manual mapping was partly bridging the gap or using the method to evaluate payload (Bernaille et al., 2006). Through an automated mapping system, Erman et al. (2006) has also been suggested. Throughout their system, a variety of flows are pre-labeled manually depending on payload. Next, pre-labeled flows together with unlabelled flows are fed into the clustering algorithm K-means. Then, a large number of traffic clusters are mapped to several established applications using the accessible labeled flows. Finally, the closest cluster will be assigned a new traffic flow. Their experimental results showed that the mapping system would achieve high precision with a reasonable array of named flows. These mapping methods can combine traffic clusters of existing flow-based applications, but they cannot fuse the traffic clusters of unknown applications. In recent years, several methods of clustering utilizing payload have also been introduced. Ma et al. (2006) developed three models to capture numerical and systemic dimension data flow sets. The researchers introduced a clustered solution to flow sets and checked their traffic database approach with more than 10 implementations. Zhang et al. (2013b) have combined a statistical signature of Contact and the Kmeans algorithm to identify unclassified traffic groups dependent on material for use. Wang et al. (2013) suggested using the clustering method for the automatic creation of software signatures dependent on the classifier. They evaluated several supervised classifiers of the clustered traffic generated by the Xmeans on the payload content of 32 bytes. Such experiments highlight the utility of flow charges to identify specific traffic groups; however, it remains uncertain how to describe the substance and calculate the similarity of traffic clusters, and supervised learning relates to several other works to the classification of traffic on a payload basis. While the classification of traffic is a more reliable way of searching for program signatures in the payload material, it takes a lot of time to derive the signatures manually. To address this issue, Moore and Zuev (2005) have used Classifier Naïve Bayes of kernel estimate and Fast Correlation-Based Filter (FCBF), which has been proposed to be categorized. They used a wide range of 248 functions, including packet series information and the TCP protocol. Moore used nine strategies to identify hand-set info, including gate number, payload header, single packet signature and protocol attribute, first K-Byte payload, host background, etc. Application signatures are obtained in these studies through an in-depth packet-level trace analysis or device procedure documentation (where accessible). The latest developments have seen an attempt to free citizens from the stressful pre-processing phase of labor. Wang et al. (2010) suggested using supervised machine programming to recognize signatures to a variety of technologies automatically. Finamore et al. (2011) suggested code signatures to carry out traffic analysis using g numerical characterization of payload and implemented controlled algorithms, such as the support vector machine (SVM). Existing classification methods based on payload, however, cannot deal with "unknown applications." Zhao et al. (2008) suggested traffic classification real-time feature collection. The different types of characteristics used in the traffic classification are discussed, and the accuracy of various algorithms for traffic selection, in particular the Peer to Peer (P2P) classification of traffic, is evaluated and compared in the classification of traffic. They suggested a realtime function subset to complete the online traffic classification. Cao et al. (2015) suggested that the SVM's classification performance after scaling is better, but the high feature dimension causes the SVM classifier to have a longer training time and higher computational complexity. By this method, we obtain the accuracy of each flow according to the characteristic numbers, and the accuracy would be the maximum in any characteristic number of each traffic flow. Get this number of features, and compose the best subset of features. After feature selection, the average accuracy of all flows reaches 98.69%. Lotfollahi et al. (2020) proposed a deep learning approach that combines both the extraction and classification phases of features into a single system. The proposed scheme, known as the "Deep Packet," can handle both traffic classification in which network traffic is divided into major groups (e.g., File Transfer Protocol (FTP) and P2P) and application recognition in which end-user applications (e.g., BitTorrent and Skype) are defined. Unlike most current approaches, Deep Packet can recognize encrypted traffic and even differentiate between network traffic, and the Deep Packet architecture employs two deep neural network architectures, architectures for network traffic classification.

Proposed model
The essential aim of this paper is to classify and predict network traffic data. The proposed method has four primary steps, which are: A. Pre-processing, such as data cleansing, outliner, and missing values removal. B. Dividing the dataset into learning and testing data. C. Applying the multiple neural network algorithm and feature selection method across the network traffic dataset. D. Classifying and determining the accuracy of the known applications and unknown applications that affect the network.
The steps of the proposed model are demonstrated in Fig. 1.

Data pre-processing
Data pre-treatment is one of the most important steps in the preparation of the data set before the mining process. In our research, the statistical program software platform, which offers advanced statistical analysis (IBM SPSS), was used to analyze the data, and the hybrid method consisting of neural networks and a feature selection method. We used the ten ready datasets collected by a highperformance network screen (Auld et al., 2007). The data were classified according to the entry for each category. The parameters input of the neural network and feature selection that have been used are Interactive, Database, Games, Sevres's, Mail, Www, p2p, Attack, and Media. We used the same 10 datasets that were extracted by Moore and Zuev (2005). 249 different discriminators have been used in our research to define traffic flows, including statistics on flow length, TCP port data, statistics on payload size, and four-part packet transformation. They constructed a flow collection by tracking the day, breaking it into ten blocks of roughly 1,680 seconds (28 min) each. They offered a wider variety of mixtures during the day using the random selection of samples. The dataset comprises specific flow levels in each data-block. Traffic single block of 28-minutes has been captured owing to the higher traffic level. We divided all groups into percentages to determine the accuracy of each percentage of the training and test experiments of the data with 50%, 60%, and 70% for training, 50%, 40%, and 30% for the test, respectively. Accuracy results were extracted from each group, and the average results for these groups have been calculated.
In this phase, the text pre-processing stage contained three sub-stages, which were text chunk, stop words withdrawal, and term stemming. A text chunk partitioned a text archive into sub-sentences. Several of the studies which concentrate on text preparing strategies in various fields incorporate intrusion detection (Sharma et al., 2007). The step of stop terms removal for erasing meaningless terms was utilized. A stemming procedure to delete the attached (suffixes and prefixes) in a term to create its root term was additionally connected. This progression separated the critical terms from the text and disregarded the rest of the terms. This may have influenced the comparability between texts unfavorably.

Combination stage
In this stage, we combined two techniques: the neural network approach and features selection methodology, to learn about the classifiers.

Artificial neural networks
Neural networks are the typical depiction of the brain focused on nature neurons that are associated with other neurons to create a network, like "move the hand to pick up the cup." An artificial neural network is normally placed on tables, such that tables n-1 and n+1 will only bind to neurons (Arnx, 2018). We can characterize an artificial neural network like Fig. 2.
Usually, neural networks tend to be converted from left to right. The first layer here is the one that accesses outputs. There are two internal layers that do some algebra (known as invisible layers) and one final layer that includes all possible inputs. Do not mess around with the "+1"s at the bottom of each line. It's labeled "bias." Every neuron's operations are quite simple ( Firstly, it applies the meaning of an earlier section that is correlated with each neuron. There are three neuron outputs in Fig. 2 (x1, x2, and x3), so our neuron is related to the three neurons of the previous section. By applying this value, this quantity is compounded by another variable named "weight" (w1, w2, w3). That decides how they interact amid the two neurons. Increasing neuronal interaction has its own weight, and these are the only principles that can shift in the course of learning. In contrast, the estimated total cost can be added to a discrimination variable. It is not an output that comes from a single neuron that is selected before the learning process, so it can be helpful to the network. This is all done by a neuron. One has to take all the values compounded by their respective weight from attached neurons, bind them, and add an activation mechanism to it. The neuron will then give the new value to other neurons.

Fig. 2: Representation of multiple neural networks
The parameters for the Neural Network algorithm have been selected based on the nature of the data on the network. The input parameters have been tuned as features of the applications on the network, while the output has been adjusted as a target of classification features to known and unknown applications based on the weighted parameters in the hidden layers.
The neural network moves to the next row after each neuron of a column has been made. The last obtained value must eventually be one that can be used to evaluate the output wanted. This is how the learning cycle functions: First, note that it returns an output when an input is provided to the neural network. It cannot get the right output on its own at the first attempt (except with luck), and this is why each input comes with its tag during the learning phase, indicating what should be the performance of the neural network. If the option is the right one, the variables will be preserved, and the corresponding data will be given. If the output received does not match the tag, however, weights will be modified. These are the only factors in the learning phase that can be modified. This mechanism can be interpreted as several keys, which are converted into different possibilities each time an input is not correctly calculated.
A complex process called "backpropagation" is performed to decide what weight is best to change. We are not going too far longer on this, as the neural network we are trying to create does not use the same method, so it is about going back to the neural network and testing each relationship to see if the output would respond as a consequence of a weight shift.
Eventually, there is a final variable to know how to monitor neural network learning: the "learning level." It defines how quickly the neural network is going to know or, more precisely, how the weight will shift, slowly or in bigger steps. Ultimately, this variable is a good value. Now that we understand the fundamentals, we can test the neural network we are going to create. This design allows two classes to be separated by an easy category. Let's see a quick example (which has little value except to understand) to better understand the possibilities and limitations.
When we substitute the "trues by 1 and the falsies by 0" and put the four options on a graph as coordinate points, it is clear that the two "false" and "true" final classes can be separated by a single line. This can be done by a perceptron. A neural network can be built from scratch with Python (3.x in the following example): Import numpy, random, os lr=1 #learning rate bias=1 #value of bias weights=[random.random (), random.random (), random.random ()] #weights generated in a list (3 weights in total for 2 neurons and the bias) Put simply, libraries and parameter values can be defined at the start of the program, and a list containing the values of the weights to be changed can be created at random.
Below is a structure that determines the output neuron function. It needs three variables (the two values of the neuron and the output predicted). "OutputP" is the variable that corresponds to the perceptron's output. Then, we compute the error, which is used straight afterward to change the weights of each connection to the output neuron (Arnx, 2018).
for i in range (50) :  Perceptron (1,1,1) #True or true Perceptron (1,0,1) #True or false Perceptron (0,1,1) #False or true Perceptron (0,0,0) #False or false (2) We are building a circle loop repeating every situation several times by the neural network. This part is the process of reading. The number of iterations is selected based on the reliability we need. We have to be mindful, however, that too many iterations could result in the network being overfitted, allowing it to concentrate too much on the instances being handled, so it cannot get the right performance of a case that it did not see during its training phase.
Nonetheless, our situation here is very different because there are only four options, and we send all of them during their learning phase to the neural network. A perceptron should give the right output without ever seeing the case that is being treated (Arnx, 2018).
In this case, it is useful to use the activation function, Heaviside. All values are taken back to exactly 0 or 1, as we are finding a fake or a real value. We may try to get a decimal number between 0 and 1 with a sigmoid feature, typically very close to one of those limits. (4) We could also save the weights already determined in a file by the neural network to use later without any additional stage in the learning experience. This is done for a broader project and in that cycle will last days or weeks. The study suggested a multiple neural network technique to predict and filter data traffic on the network to identify unknown applications through the physical network. Multiple neural network algorithms were used to perform a scientific experiment to assess the accuracy of internet traffic for potential enhancement. It has been demonstrated that several neural network model applications can be used to predict and process high accuracy network data traffic, as we will be doing later.

Feature selection algorithms
In this section, we present the classical selection algorithm: a forward selection of features (Mao, 2002). Then, we examine selfish forward algorithm variants to boost computational efficiency without the risk of losing so much accuracy.
The feature selection process begins by analyzing all sub-sets of features consisting of one attribute for data. In other words, we start by measuring the subsets of one element's Leave-One-Out Cross-Validation (LOOCV) error, [X1], [X2], ..., [XM], where M is the input dimension, so we can find the best individual component, X(1). The complete selection process for selecting the function up to m attributes: First, forward selection would find two strong subset components, X (1), and another function of the rest attributes of M-1 data. Therefore, there are M-1 pairs in total. Suppose X (2) is the other attribute besides X (1) in a strong set. Then, the input subsets are tested with three, four, and more functions. The safest m-function subset is the mtuple composed of X (1), X (2), ..., X (m), according to the forward selection, while the overall best collection of features is the winner of all measures of the M. If the cost of a LOOCV evaluation of I features is C (i), then the computational expense of choosing a sub-set of size m out of the total M input attributes would be:

MC(1)+(M-1)C(2)+…+(M-m+1)C (m)
Liu and Motoda (2007) estimated the cost of predicting one-nearest-neighbor as function, using a kd-tree with j inputs, is O (j log N) where N is the number of data points. Therefore, the expense of measuring the mean leave-one-out mistake, including calculations of N, is O (j N log N). So, the maximum expense of using the aforementioned equation to pick the function is O (m2 M N log N).
We can also use an exhaustive search to find the best overall output feature collection. The exhaustive search starts by searching for the best onecomponent subset of input features, which is similar to the forward selection algorithm. Instead, the strongest two-component subset of features that may consist of any pair of input features will be identified. It then moves to find the best triple out of all the combinations of each production of three functions, etc. The comprehensive quest meaning is as follows (Arnx, 2018):
Nevertheless, the forward option will suffer because of its greed. For example, if X (1) is the best individual function, then there is no assurance that either [X (1), X (2)] or [X (1), X (3)] would have to be better than [X (2), X (3)]. Thus, a forward selection algorithm may pick a feature set other than the one selected by exhaustive quest. Estimating a query with a poor set of features of the input: Xq=[x1, x2, ..., xM] can vary significantly from the true Yq.

Experimental design
This experiment was aimed at identifying and filtering unknown internet traffic applications. We used the ready dataset that was gathered via a highefficiency network panel. We used their minimal loss and capture of complete payload to a disk with a resolution of more than 35 nanoseconds for timestamps. They examine data in time from one website over several different periods of time. This place is an investigational center hosting approximately 1,000 internet-connected users through a Gigabit Ethernet full-duplex connection. For each traffic collection, full-duplex traffic on this link has been controlled. The location they were looking at houses many biology-related buildings, collectively regarded as a Genome Campus. There are three organizations on-site that hire about 1,000 scholars, managers, and professional personnel. This is a campus connected to the internet with a full-duplex Gigabit Ethernet link. Our screen was put on this internet connection. For each traffic array, traffic was tracked for a complete 24-hour, weekday duration, and for all connections.
Appropriate input data are needed for the analysis of data using the neural network technique. To this end, we capitalized on the trace data identified and categorized. This confidential data was further reduced, with each having about 25,000-65,000 items (flows) separated into ten periods of equivalent time. In addition, each data set was used as a training set and tested against the remaining data sets to determine the efficiency of the neural network methodology, allowing for estimation of the average classification accuracy. In each round, the data were divided into three clusters (Osman and Aljahdali, 2017) (70%, 60%, and 50%) for the learning process and (30%, 40%, and 50%) for research. Each learning and testing takes the following traffic into account.

Traffic categories
One of the fundamental matters for the classification movement is the selection of categories from the flowing data to perform the classification on. In this research, we use the most popular categories of users, such as (BULK, DATABASE,  INTERACTIVE, MAIL, SERVICES, WWW, P2P,  ATTACK, GAMES, MULTIMEDIA), examples of which are given to them. In Table 1, these categories are not all traffic data only. They are popular categories, and therefore we ran experiments on them. Each category has unique characteristics and features (such as the source and destination ports), a certain amount of information, and its own behavior. Together, this information and data form the important values for input to make classifications for data traffic within the network. A probabilistic classification approach for internet traffic is used for specific classes to determine the flow characteristics and shunt of the potential layer; for example, flow is classified with a probability of 0.9 that it is a game, 0.1 bulk, and 0.2 that it is www. Flow is classified with the highest likelihood, and in the example, flow is classified as a game category because it is the highest probability. Table )1( shows the network traffic allocation to each category (Moore and Zuev, 2005).
Our key objective for classification is the flow, and for the research addressed in this extended abstract, we restricted our description of the flow to a maximum TCP flow-that is, all the packets between two hosts-for a specific tuple that we limit to complete flows, those that validly start and finish. The illustration of the classification of the discriminators for every entity can be shown as:  Data Flow time  Port with TCP and unshielded twisted pair (UTP)  Intra-arrival packet time (mean, variance)  Element of payload (mean, variance)  Active Entropic Bandwidth  Fourier transfer of inter-arrival time for packets An example of discriminator classification objects has been presented in order to classify the scheme that involves defining each element's characterization. By using these variables, the classifier assigns an entity to a class because of its potential to enable discrimination between classes. Such object-describing parameters are used as discriminators. 249 different discriminators (Moore and Zuev, 2005) have been used in our research to define traffic flows, including Statistics on flow length, TCP port data, statistics on payload size, and four-part packet transformation.

First experiment
The study used the experiment that was described before and after combination using neural network methods to evaluate the improvement of the proposed method. Before the upgrade, the filter tests were obtained using a neural network, which is only 98.54%, 98.55%, and 98.60% of learning data and 98.66%, 98.48%, and 98.58% of research data, respectively, in Table 2. On the other hand, filtering performance outcomes after enhancement using the combination approach between the neural network and the potential rating algorithm was 98.68%, 98.98%, 98.93% for training data and 99.10%, 99.11%, and 99.04% for testing data.
The outcomes of the tests are determined as:  Figs. 4, 5, and 6 demonstrate that the testing filter accuracies of the multiple layers perceptron (MLP) neural network have 50%, 60%, and 70%, respectively. The results of the neural network only with the latest training and testing tests algorithm are presented. This is the graph obtained for the neural network training and testing performance before using the feature selection technique in Table  2. As for the classification using a neural network, a successful filter for the training data was obtained 98.54% with a sample size of 50% and filter performance of 98.66% for processing data with a data size of 50%. Expressively, the classification performance outcomes obtained by the neural network were 98.54%, 98.55%, and 98.60%, respectively, for learning tests of 50%, 60%, and 70%, and 98.66%, 98.48%, and 98.58%, respectively, for research studies of 50%, 60%, and 70%.

Second experiment
The performance obtained after integrating the neural network with feature selection is described in Table 3. The accuracy of the filtering process is clarified by the values of results. There are various results that have been derived from the neural network. Such findings have been improved by using the features selection method. The neural network with a feature selection approach obtained better accuracy than the neural network alone.  Table 3 for training and testing experiments. After using the feature selection algorithm with important features, the figures obtained represented test results of the neural network with the first collection of data. The classification using a neural network without a feature selection method for testing data obtained optimum filtering for an average of 10 datasets is 98.66%, 98.48%, and 98.58 with data size of 50%, 60%, and 70%, respectively (Table 4). The classification using a hybrid method (neural network with feature selection) for testing data obtained optimum filtering for an average of 10 datasets is 99. 10, 99.11, and 99.04 for research experiments of 50%, 60%, and 70%, respectively (Table 5). From this, we conclude that the use of the hybrid method led to better results than the use of the neural network method alone.    For comparative approaches, most authors used the methodology of statistical significance for the ttest. The t-tests use the hybrid approach to assess statistical significance. Among the findings obtained from Experiment 1 using the neural network-mlp and Experiment 2 using the hybrid approach include the neural network and feature selection technique revealed improvements. A small t-test meaning value typically less than 0.05 indicates that the two variables differ. The outcome of the t-test is 0.0183; this condition has been stressed in estimation steps, indicating that the hybrid methodology (neural network with a collection of features) has obtained important results on the accuracy of the study. This discrepancy is deemed statistically relevant by traditional standards. Table 6 shows t-test comparison results between the neural network algorithm before and after feature selection. The proposed approach implemented the algorithm of feature selection to boost the process of the neural network. In the classification process, only the most appropriate features as selected by the feature selection method were used. The findings of the experimental test dataset showed that better results were obtained by the overall performance of the proposed method. The hypothesis presented the idea that the selection technique can improve the quality of classification. The proposed method's emphasis was changed so that attention was paid to before and after the combination phase to analyze the changes made by the proposed method.
The comparison between the proposed method and state of the art is illustrated in Table 7. We noted that the proposed method achieved better performance results in terms of classification accuracy.
We noted that the shortcoming of the proposed method is that the prediction model can classify only offline applications rather than online applications. The time of classification needs to be improved if the model is to be upgraded to work online.

Conclusion
This research attempted to solve the issue of the identification of internet traffic. The study suggested that the hybrid method of multiple neural network technique and the feature selection method can be used for predicting and filtering network data traffic to classify the unknown applications through the physical network. A scientific experiment has been conducted using multiple neural network algorithms to determine the reliability of internet traffic for future enhancement. It has been shown that the applications of the multiple neural network models can be used to predict and filter the network data traffic with high accuracy.
The data was collected and divided into 10 groups, and each data group was divided into percentages to be based on the experiment, as follows: 50%, 60%, and 70% for training, and in return for the same data 50%, 40% and 30% for the testing. As for the percentage of 50% for testing, the results showed a clear improvement in classification and verification with an accuracy rate of 99.10% and with an error rate not exceeding 0.9, and thus we would have improved the accuracy of the classification of unknown internet traffic applications by using multiple neural network algorithms with the feature selection method.

Conflict of interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.