A novel approach in detecting code clones in Java using DFS

Code is the rudimentary element of any software. Code clones may be defined as the segments of the program which are akin to one another. The similarity may be either syntactic or semantic. Cloning is easy to implement but hard to detect. Many researches have been carried out in order to find the methods for detecting these clones of code as problems are encountered at the time of maintenance due to these clones in codes. This further increases the cost of maintenance. The objective of our work is to precisely detect the code clones. Here, an approach is proposed based on the Abstract Syntax Tree method. The purpose for adopting AST is that it gives better detection results as compared to other techniques and is considered to be the best approach for detecting type 3 code clones. Furthermore, AST offers syntactic knowledge which can be leveraged to filter certain types of clones. The results obtained clearly shows that the technique adopted is able to precisely detect the near-miss clones as compared to the tools namely NICAD and CLAN.


Introduction
*Code cloning is the act of copying the segments of code and pasting it to another place. At the first glimpse it seems to be a fascinating concept as the programmer doesn't need to write the same code again and again if the working of two code segments needs to be similar, but copy-paste strategy is a short term win.
Copying the code from one position and pasting it to another has various pitfalls which come into sight at the time of maintenance and testing of the software. If there are complications in the original code that was pasted it will be disseminated to the cloned/pasted segment too. For example, if a programmer makes any slight modification in the code and if the same change is not made in the cloned part then it may produce inconsistencies. In the large software systems it becomes really strenuous to uncover where this code has been reused. Searching in entire program is time consuming and practically an infeasible job. Clones produces bad impact on the design and also on the system improvement and modification as it is quite common that the person who developed the original system is not the one who is maintaining it. In the long run, the software may become so complex that even minor changes are hard to make. Clone detection came into existence to solve this problem.
With the help of clone detection technique, we can easily find out where the clone exists and can remove them beforehand so that they don't create any problem in future.
The studies reveal that almost (5-10 %) of the source of large computer programs is duplicated code (Baxter et al., 1998).

Types of code clones
There are various levels of clones as identified by Bellon et al. (2007). They are:  TYPE-1: the codes which are exactly similar to one other without any kind of difference in the source code are placed under Type-1 clones. They may also be termed as syntactically similar codes.  TYPE-2: the codes which are similar to each other except some of the changes in the white spaces, variable names, data type, arguments etc. are put under Type-2 code clones. They are also syntactically similar codes.  TYPE-3:the codes with further modifications allowed in the source code like some of the additional code lines may be added or the ones present in one may not be present in another but both performing the same function are placed under Type-3 code clones.
 TYPE-4: they are semantically or behaviorally similar code segments. They don't have anything common in the source code but the functions performed by them are exactly the similar of each other. The example of each kind of the clone is given in Table 1.

Root causes for code clones
A study by Kontogiannis et al. (1996) reveals why programmers just copy and paste the code. They identified the following reasons by observing the programmers in their daily practice:  Sometimes it may be due to the short time limits given to the programmers by the client for the development of the software.  Systems are modularized based on the principles such as minimizing coupling, information hiding and maximizing cohesion. In the end -at least for the systems written in ordinary programming languages-the system is composed of fixed set of modules (Koschke, 2007). Ideally, if the system needs to be updated, only few modifications will be required.  Another root cause is that programmers often reuse the copied text as a template and then customized the template in the pasted context (Koschke, 2007). Other potential reasons such as time pressure, educational deficiencies, development process, and short sightedness must also be investigated.  Phobia of fresh code.  Complexity of the system.

Clone detection methods
There are various methods of detecting the clones which includes:

Text based
They are language independent and provide an easy way to detect the clones among various programming languages. The major shortcoming of this method is that it can detect only Type-1 clones along with some of the Type-2 clones which minor changes such as different formatting style.

Token based
In this technique, the code is first of all transformed into the token sequence. After that the sequence is formed from some set of tokens which are then compared to find the clones. The major advantage of token based technique is that it is fast with higher recall values.

Syntax tree based
Here, we use the parser to build parse trees or abstract syntax trees from the source code. The trees thus obtained can be processed further using the tree-matching to find the clones. Roy et al. (2009) explained that the abstract syntax tree or parse tree contains the complete information about the source code. In order to find the clones using the syntax tree approach, the subtrees are compared and those which come out to be similar are considered as the clones. The code corresponding to these sub-trees are returned as clone pairs.

Graph based
A program dependency graph (PDG) represents control and data flow dependencies of a function of source code (Rattan et al., 2013). In other words, it considers the semantic information encoded in the dependency graph. Clones may be identified as isomorphic sub-graphs in a program dependency graph (Krinke, 2001).

Metrics based
In Metrics-based approach, a number of metrics are assessed for the code segments which can involve the number of lines, number of input statements, number of output statements, return statements, function calls etc. in each of the segments. The metric values are then compared instead of the source code directly. The two segments whose metrics values comes out to be similar to each other are considered as clone pairs.

Proposed approach
Observing the advantages and disadvantages of various techniques developed so-far, here abstract Syntax Tree based approach is used to detect the code clones. Our approach will find the syntactic clones in linear time and space.
Here we used the Depth First Search (DFS) algorithm which is an algorithm for searching in a tree. One starts at the root and explores as far as possible along each branch before backtracking. The approach adopted is as follows: 1. Firstly the code will be passed into the ANTLR parser. ANTLR (another tool for language recognition is a parser generator that uses LL (*) for parsing (https:// en.wikipedia.org/wiki/ANTLR). ANTLR can generate lexers, parsers, tree parsers and combined lexer parsers (https:// en.wikipedia.org/wiki/ANTLR). The purpose of doing so is to obtain the syntax tree representation of the code. The example of AST formed for a particular code is ( Fig. 1  DFS (Depth First Search) is applied to both the trees in parallel.  Then for each of the node of the tree, convert it into the template. The procedure for template conversion is as follows:  Template conversion is the procedure of converting the source code into a new form which is uniform intermediate representation of source code.  Type 1 clones are exactly similar to each other so there is no need to convert them into templates.  For type 2 clones, the clone methods may contain difference in names of variables ,identifiers, data types, white spaces etc. for converting them into template, we can replace all the identifiers names into a common name as 'X' and all the data types into a common data type 'DATA'.  For type 3 and 4: in case of type 3-4 clone detection, various constructs like branches, iterations can also be changed. Therefore we need a general method for converting them into a form which is common. The method for the conversion is given in the Table 2:  Then for each node (converted into template) check if the children of the node in the tree exist.
If it exists, store them in prefix order in an array (apply this procedure on both the trees whose nodes are now present in the form of templates)  Compare the elements in both the arrays. If similar elements exist, store them in a separate list.
 Now, for all the elements/nodes which exist in the list, apply Levenshtein (1966) distance algorithm to find out the distance between the nodes.
 It is applied considering two nodes at a time and comparing them element-by-element.  If the two nodes comes out to be  Exactly similar, their cost will be set as 0 otherwise 1 in the opposite case.  Now for all the pairs of nodes in the tree whose Levenshtein (1966) distance/ cost comes out to be 0 are stored in an array and are marked as the clone pairs.

Results
The proposed approach has been tested on various open source software available. The implementation is done with the help of the selfcreated tool with input of JAVA project files. The tool is able to find out precisely Type1, Type 2 and Type 3 code clones. The project sources used is shown in the Table 3. The source codes of the above projects are fed into our system and the clones are detected in their source codes. The results obtained are as follows in Table 4. The results obtained in the form of clone pairs are in Table 5.

Comparison with existing tools
The tool developed using the proposed approach is being compared with the existing tools. The two tools are used other than the proposed one. They are NICAD and CLAN. They all are applied onto the projects. The results obtained are in Figs. 2 and 3.

Conclusion and future work
In this paper, we have proposed an approach to detect Type-1, 2 and 3 code clones. The proposed approach quickly detects Type-2 and 3 clones which normally are not being detected by all the existing approaches and if they do so, then not as precisely as the proposed approach.
In this approach we are able to feed only a single source code file at a time. For future work, we may apply the detection at the directory level which may contain multiple numbers of files in it and detects the clone pairs in them. Type-2 Tpye-1