Implementation of early and late fusion methods for content-based image retrieval

Content-Based Image Retrieval (CBIR) systems retrieve images from the image repository or database in which they are visually similar to the query image. CBIR plays an important role in various fields such as medical diagnosis, crime prevention, web-based searching, and architecture. CBIR consists mainly of two stages: The first is the extraction of features and the second is the matching of similarities. There are several ways to improve the efficiency and performance of CBIR, such as segmentation, relevance feedback, expansion of queries, and fusion-based methods. The literature has suggested several methods for combining and fusing various image descriptors. In general, fusion strategies are typically divided into two groups, namely early and late fusion strategies. Early fusion is the combination of image features from more than one descriptor into a single vector before the similarity computation, while late fusion refers either to the combination of outputs produced by various retrieval systems or to the combination of different rankings of similarity. In this study, a group of color and texture features is proposed to be used for both methods of fusion strategies. Firstly, an early combination of eighteen color features and twelve texture features are combined into a single vector representation and secondly, the late fusion of three of the most common distance measures are used in the late fusion stage. Our experimental results on two common image datasets show that our proposed method has good performance retrieval results compared to the traditional way of using single features descriptor and also has an acceptable retrieval performance compared to some of the state-of-the-art methods. The overall accuracy of our proposed method is 60.6% and 39.07% for Corel-1K and GHIM-10K datasets, respectively.


Introduction
*There are two possible solutions for searching large image databases; the first, old and insufficient solution is text-based image retrieval (TBIR) (Sumathi et al., 2011), in which images are manually annotated with suitable names; this meta-data is attached to images, and the end-user later applies keywords to search for and obtain the necessary images. This approach has two major disadvantages; the first problem is obvious, as a significant amount of time is required for the manual annotation process, while the second disadvantage is the amount of responsibility placed on the end user to form their own queries. An alternative and effective method is content-based image retrieval (CBIR); this method has many advantages that solve the major problems highlighted by the previous method. Also, CBIR became the main method and an area of much research in the last two decades due to the rapid advancement of multimedia data through modern sources such as the Internet, smart devices, internet of things IoT devices and sources, social networks, and medical image sources (Seetharaman and Kamarasan, 2014;Alsmadi, 2020). Most of the earlier CBIR studies were based on individual or single feature descriptors extracted from color, textures, or shape contents/descriptors (Mistry, 2020); however, single feature spaces are not always sufficient to provide the best retrieval results. Therefore, most of the recent research has combined more than one descriptor from the above examples. The early survey of numerous and diverse application areas of CBIR was reported by Veltkamp and Tanase (2002), where the authors provided an early review of CBIR at the end of the early years. Researchers could refer to Latif et al. (2019) for the latest comprehensive review of CBIR.
In this paper, we proposed a retrieval method that exploits the use of color and textual information, and implements and compares the two common types of fusion, after which it suggests the use of the best approach to retrieve similar images. The main contributions of our work are as follows: 1. A fully automatic CBIR method is proposed for retrieving similar images based on the fusion of visual and textual characteristics. 2. The proposed approach has good retrieval results when good representative extracted fractures of both color and textual images are combined into a single vector. The more accurate the features used by the system, the higher the performance retrieval results obtained by the system. 3. The method also supports the idea of merging more than one result with similar measures. The user can use any group of similarity measures that may present a good final fused value of precision better than individual similarity values.
The rest of the paper is organized as follows. The related work is defined in Section 2. In Section 3, the proposed method is described in detail; feature extraction methods from the color and textual information of images are described. In Section 4, the experimental setup and image datasets are discussed along with implementation details and an evaluation of the results. Section 5 concludes the proposed work.

Related works
Over the years, numerous approaches have been suggested in the literature to combine and fuse various image descriptors (Bhowmik et al., 2014). Fusion approaches are usually categorized into two classes: Early and late fusion (Snoek et al., 2005;Atrey et al., 2010). There are different fusion strategies and approaches that have been developed, these strategies could be classified into five groups: traditional early fusion, features reweighting for early fusion, representation by multi-feature spaces for late fusion, relevance feedback approaches, and multimodal retrieval. In the following, we will give a short definition and recent and some examples of recent studies that applied these ideas. In normal or traditional early fusion, multiple image representations (such as color, texture, and shape) are combined into a single feature vector. This approach is straightforward and very simple and normally uses an equal weight value scheme assigned to any of the feature spaces. Several studies have followed this method, such as that in Singh and Hemachandran (2012) and Alsmadi (2020). Feature weighting for early fusion is suitable when the retrieval process is considered as a classification task in which pools of images are assigned to a set of labels. In this case, images with the same labels are similar to each other; a technique such as principal component analysis (PCA) could be used for the extraction of a new feature space rather than capturing the similarities (Piras and Giacinto, 2017). Several studies that address the different weighting estimation approaches and their related problems can be found in Rui et al. (1997), and reweighting schemes for similar retrieval problems can be found in Abdelrahim (2013) and Ahmed et al. (2013;2014;. Over the past years, multi-feature spaces for late fusion were enhanced; authors in the pattern recognition area of research have suggested a number of solutions for fusing the information based on combinations of the outputs of different classifiers that use different feature spaces, as shown in Kuncheva (2014). The most popular and effective techniques for merging similar outputs are based on late fusion methods such as the mean, maximum and minimum rules (Alhassan and Alfaki, 2017;Ibrahim et al., 2018;Ahmed and Malebary, 2019). Relevance feedback has an obvious role in enhancing the retrieval task, where an image query is reformulated according to the top rank image retrieved (Karamti et al., 2018;Ahmed, 2020). Similarly, the query expansion process has significantly enhanced the retrieval performance (Houle et al., 2017;Ahmed and Malebary, 2020). The main idea of multimodal retrieval is to combine keywords with low-level features in order to use a combined input space. This suggestion was followed by some works such as that described by Zhou and Huang (2002), in which the authors use the word association via relevance feedback (WARF) method to learn the keyword similarity matrix during user interaction (Sclaroff et al., 1999); several other researchers have addressed this problem from different points of view (Barnard and Forsyth, 2001).

Proposed method
In this paper, the normal early fusion method with no weighting scheme and multi-feature space with multiple similarity measures for late fusion is proposed. As we will discuss in detail, those two proposed fusion methods are considered a simple approach that will enhance the retrieval performance. The overall CBIR framework is shown in Fig. 1.

Feature extraction and early fusion
As shown in the proposed framework, a group of color and texture feature extraction functions is first applied to extract the most important representative image features. After successful extraction, these features will be combined into a single vector; this process will be performed for both queries as well as every image in the database. For color feature descriptors, a total of 18 features are extracted using most of the common color moment functions.
First, each image in the database, as well as the query images, are converted from the RGB color space to the HSV color space (Erkut et al., 2019), then each of the "H", "S" and "V" channels are used as an input for the color moment functions and six values are generated.

Fig. 1: Workflow of our proposed CBIR method
All color features for this descriptor are combined into a single vector. The group of sixmoment functions used here is: mean, variance, sknew, kurtos, smoothness, and contrast. The following equations (1 to 6) describe the formula for each of these functions; these were proposed by Maheshwary and Srivastava (2009) and were successfully used in recent studies (Zenggang et al., 2019). Here, the value of pixels at the i th row and j th column become vij and the dimension of the image is (M, N) pixels; therefore, the six color moment functions are defined as follows: Similarly, a total of 12 texture features were extracted using the grey-level co-occurrence matrix (GLCM); this method was introduced by Weszka et al. (1976). As in color features, after converting each RGB image to the HSV color space, then every single channel is used as an input for the GLCM method; then, from the output matrix of co-occurrence, the following functions were used for texture feature extractions: p(i,j)=(p(i,j))/R is the (i,j) th entry in the normalized matrix, Ng is the number of unique or distinct grey levels, and µ and σ are the mean and standard deviation for image intensity, respectively.

Similarity measures and late fusion
After constructing features extracted using color and texture descriptors, the next stage is to perform late fusion based on three common similarity measures; the measures used for late fusion here are Bray, Cityblock, and Euclidean. These distance similarity metrics have shown good performance results (Shirkhorshidi et al., 2015). The following equations describe these distance measures. For the late fusion strategy here, the minimum distance (maximum similarity) is applied, and then the combined values will be used for further ranking processes, as will be discussed in the following section.
where, n is the number of dimensions of the X and Y vectors.

Image datasets
In this study, the Corel-1K (Li and Wang, 2008) and GHIM-10K (Liu et al., 2015) datasets were used. These two image datasets have been recently used in Varish et al. (2020). The images in the first dataset were divided into ten different categories: Africans, Beaches, Buildings, Buses, Dinosaurs, Elephants, Flowers, Horses, Mountains, and Food. Each class or category consists of a hundred images with the resolution 256×384 or 384×256 pixels. The second dataset used in this study is GHIM-10K which has a total of 10K images (10 times larger than Corel-1K) and is considered to be more challenging and diverse and contains various objects. It consists of 10,000 images divided into 20 groups/categories where each group or class has 500 images with a resolution of 300×400 pixels or 400×300 pixels; the semantic names of the images in each group are sunsets, bikes, forts, ships, flies, cars, etc. Figs. 2 and 3 show the sample images of each dataset where a single image from each class has been taken.

Performance evaluation
For evaluation measures, recall and precision were used; these are the most common performance measures for most retrieval methods. Here, both measures are calculated as the top ten retrieved images using the following formula:

Results and discussion
In this study, three different experiments were conducted: Retrieval was based on individual feature descriptors of color and textures only, with the implementation of early fusion for both feature descriptors and late fusion using combined vectors of feature descriptors and three similarity measures described in our methodology. For the three experiments, the same ten random image queries selected from each class of both datasets were used and recall and precision were calculated as the top ten images retrieved as performed in many related studies. The following subsections give more detail about our experiment results. The first experiment used eighteen color features and twelve texture features for both datasets images separately with the three individual similarities (distances) measures. Average recall and precision for each class were computed as the top ten retrieved images, as shown in Tables 1 and 2 for both datasets; the results in these two tables represent the base results that our proposed methods will target to enhance and increase their performance, as we will see in the next scenarios. From the two tables, we can see that color features have better retrieval results or performances compared to texture, and Euclidean distances provide the best distance measures of the three distance coefficients.
The concept of the early fusion method is implemented in the second experiment, in which the simple merging of feature vectors of both color and texture are combined horizontally, resulting in a single feature vector with thirty features (eighteen color features and twelve texture features). The same random image queries were used for each class and the average recall and precision were calculated. In the late fusion scenario, the combined vector of thirty features was used again and similarity values for the three similarity measures are fused or grouped together; then, the minimum for these three values was taken. Finally, the average recall and precision were taken from the fused similarity values; the results of early and late fusion experiments are shown in Tables 3 and 4 for the two datasets, respectively. Also, Figs. 4 and 5 show the combined results of the three retrieval methods for two datasets which make them easy to compare and conclude some important findings from their recall and precision values.
Figs. 6 and 7 show all of the recall and precision values for three retrieval methods for the two datasets, respectively. The enhanced results from Tables 3 and 4 and Figs. 6 and 7 show that both early and late fusion were significant and simply enhanced the retrieval performance in terms of recall and precision values.  Obvious outcomes when considering and using fusion methods as an enhancement tool or approach are clearly shown in Figs. 6 and 7, which highlight the fact that the good retrieval performance (in a red color curve) is better than when using traditional methods that are based on applying either color or texture feature descriptors individually.   More analytical results were shown using the Kendall W concordance test (Siegel and Castellan Jr., 1988). This test was found to evaluate the effectiveness of different retrieval approaches; it could also be used to measure the level of agreement between multiple sets of rankings for the same set of retrieved objects. This test was used in the current study, where the classes of images represent judges while the average precision objects were considered judges in the present context, and the precision rates of the three different methods were considered objects. The input for this test is the precision rates of the three different retrieval methods; it has two outputs: The Kendall coefficient and the associated level of significance. The first output parameter (W) ranges from 0 (which means there no agreement between a set of ranks) to 1 (which means complete agreement), while the second output parameter (P) indicates whether this coefficient value could have occurred by chance. Here, since all the retrieval performances were taken for the top 10 images, the results will be significant for this test if the value of the (P) parameter is less than or equal to 0.01 for the Corel-1K dataset and less than or equal to 0.001 for GHIM-10K, respectively. The results for this test are shown in Table 5, where the late fusion method has the agreement or confidence of 61.1% for the Corel-1K dataset and 52.4% for the GHIM-10K dataset and the ranking of the different methods in the corresponding rows in Table 5 shows the good retrieval performance of the proposed methods. Finally, Figs. 8 and 9 show visual examples of the top retrieved images and queries used for the retrieval process; ten out of ten retrieved images were achieved for the Corel-1K dataset while eight out of the top ten images were retrieved in the GHIM-10K dataset.

Conclusion
This study proposed a retrieval method based on early and late fusion approaches. The proposed method is a simple process used to enhance the retrieval performance of image retrieval based on combining feature descriptors from color and texture spaces; also, combining similarity measure values for three known distance coefficients was implemented. The proposed method has two main advantages: The automatic retrieval process and simplicity in early and late merging processes. Also, it has an acceptable level of enhancement. Further studies should be performed using more accurate and representative features; also, future studies could include some other descriptor features such as image shapes.