Data fusion in data federation using modified discriminative Markov logic networks

The quality integrated data is crucial for data mining process. The existing approaches are used trust your friends and cry with wolves principle to resolve the data conflicts. These principles are taking the value of a preferred source and taking the most frequent value. However, it is a challenge for data integration to choose the most trustworthy data source and it is arbitrary to trust only certain source. To mitigate above issues, Data Fusion in Data Federation using Modified Discriminative Markov Logic Networks (DF-MDMLN) approach is proposed. Data fusion is to resolve the data conflicts among the data from different heterogeneous databases by utilizing multi-angle features and knowledge of discriminative Markov Logic Network (MLN). The data fusion is used to improve the precision and recall of the end users’ data set. E-shopping for computer peripherals application is considered for experimentation to analyze the performance of DF-MDMLN approach. Experiments on E-shopping data sets show the effectiveness of DF-MDMLN approach. It is observed that the precision and recall of data fusion has been improved by 40% and 27% respectively.


Introduction
*The need for accessing multiple, heterogeneous and distributed data sources are increasing for decision making applications that require a comprehensive analysis and exploration of data. The data integration techniques are solutions for the above requirement. Data integration is an effective approach to combine data that reside in different sources and provide unified point of access for the end users (Lenzerini, 2002). Three types of data integration methods are: 1-Data consolidation 2-Data propagation and 3-Data federation (Hema and Chandramathi, 2011). Data federation is one of the data integration approaches, which is gaining much recognition when compared to other data integration approaches such as data consolidation and data propagation which it neither duplicates the data nor consolidates the data. Data federation is a three step process consisting of schema mapping/matching, duplicate detection and data fusion. The ultimate step of data integration process is data fusion (Dong and Naumann, 2009). This refers to the process of fusing records of same realworld entity into a single record by resolving possible data conflicts from different data sources and by detecting and removing of dirty data (Singla and Domingos, 2006;Bleiholder and Naumann, 2008).
The data fusion uses mapping rules and heuristics to remove data conflicts. In data conflicts resolution, it is difficult to identify correct data objects or to decide which data value is correct. The existing methodologies have used relational algebra operations to resolve the data conflicts (Bleiholder and Naumann, 2009).
The contribution of this paper is to implement DF-MDMLN approach for data fusion in data federation using modified discriminative MLN. The MLN is a simple approach to combining first-order logic and probabilistic graphical models in a single representation (Song et al., 2011). The DF-MDMLN approach provides high-quality data set to the queries posted by the end-users to the federation system.
Besides the introduction section, there are five sections in this article, which are organized as follows. The related research efforts are reviewed in Section 2. The DF-MDMLN approach is described in the section 3. Section 4 describes the results and discussion and in Section 5 conclusion is drawn and some future directions are pointed out.

Related works
Detailed surveys on data federation and ontology based data federation are found in Sheth and Larson (1990), Hull and King (1987) Hema and Chandramathi, (2012) and Scannapieco et al. (2004). An overall architecture was created to improve the data quality in cooperative information systems. The quality of data was improved through query processing. The quality improvement was communicated to interested data sources (Jarke et al., 1999). The explicit enterprise model was proposed to enrich the data warehouse metadata to improve the quality of the data warehouse. The precision and recall curve method was proposed for duplicate detection. The adaptive system was proposed for deduplication. In this system, the deduplication accuracy is based on the similarity between training set and test data. The static active learning and weakly labeled non duplicates methods were used for training data (Singla and Domingos, 2005). An algorithm for discriminative learning of MLN parameters by combining the voted perceptron with a weighted satisfiability solver was proposed by Bhattacharya and Getoor (2004). An iterative deduplication algorithm was proposed by Bilenko and Mooney (2003), which is used to detect and remove duplicate entity from heterogeneous data sources. A framework was proposed for duplicate detection using trainable measures of textual similarity. An adaptive approach was used for duplicate detection that is capable of learning the specific notion of similarity that is appropriate for a specific domain. High quality data sources have been selected for data integration and prunes low quality data sources before integration. Based on minimum time stamp, availability and accuracy value in the metadata, the result is processed (Motro and Anokhin, 2006).The data federation with QoS was proposed by Hema and Chandramathi (2013). The framework for online data fusion was proposed by Liu et al. (2011). The expectation level, minimum and maximum probabilities were defined for each query output. The source ordering algorithm was proposed to get desired output quickly. A framework for entity resolution based on Markov Logic Network was proposed by Singla and Domingos (2006). The similarity among the records were found using Markov Logic Network. An approach for resolving data conflicts based on Markov Logic Network was proposed. The accuracy of the data is improved using multi angle features and rules (Huang et al., 2009). The veracity problem was formulated, which was used to resolve conflicting facts from multiple websites and finding the true facts among them. An approach called TRUTHFINDER was proposed to find interdependency between website trustworthiness and fact confidence to find trustable websites and true facts (Yin et al., 2007). The current data conflict resolution strategies and functions are summarized and HumMer and FeSum research prototypes are proposed by . The Bayesian analysis is used to find dependence between data sources in truth discovery was proposed by Lowd and Domingos (2007).
Many approaches that were proposed are lacking in handling data conflicts in an efficient way. Thus the results of those are often inaccurate. However, the problem of data quality is complex in data federation environment and data quality of each data source is not rich since they are autonomous and have a varying data quality. To ensure data quality of data sources, benchmark data set is required. The bench mark data is not available for all domains. Hence additional approaches are needed to ensure the quality of the data provided to the users. To address the above issues, this paper proposes DF-MDMLN approach that improves the quality of the resultant data using modified MLNs. It is used for data fusion in order to improve the quality of the end uesrs' data.

Proposed approach
The objective of the proposed DF-MDMLN is to detect and resolve data conflicts to provide quality results to the end user. The proposed DF-MDMLN approach is shown in Fig. 1. The DF-MDMLN input is data sets from different data sources of data federation system. The distinct records and records with conflict values are grouped as separate groups. The output is the data set in which the data conflicts are resolved using modified discriminative MLN approach. Finally the quality data set is delivered to the end users.

Data Fusion using modified MLNs
The proposed approach, the evidence predicates and the query predicates are known prior. So, the discriminative MLN is used for conflict resolution (Yin et al., 2007). The predicates are partitioned into two sets -the query predicates Q and evidence predicates X. The discriminative MLN defines a conditional distribution as shown in Eq. 1.
Where Z x (w) is the normalization factor, F Q is the set of formulas with at least one grounding involving a query predicate and G i is the set of grounding formulas of the i th first order formula. g i (q, x) is a binary function and equals to 1 if the j th ground formula is true and 0 otherwise.
The proposed approach resolves the data conflict using modified discriminative Markov Logic Network (Yin et al., 2007). The evidence x can be arbitrary useful features. With the predefined features, the set of rules are defined. With these rules, MLN can learn the weight of the roles and resolves the conflicts. The steps involved in modified discriminative MLNs for data fusion are feature extraction, rules formation, weight assignment and inference.

Feature extraction
The features of the datasets are extracted from the following four aspects, namely a) basic features, b) Features of inter-dependency between sources and facts, c) Features of mutual implication between facts and d) Features of mutual dependency between sources (Yin et al., 2007). Thus, the dependency and basic features help in finding the trustworthy sources and also sources with a high degree of accuracy and completeness a) Basic Features: The basic features describe the data sources, tables, table attributes; attribute values (facts) and the relationship between them. For example, the data source s 1 provide fact f 1 . This evidence is represented by Provide (s 1 , f 1 ). If the most frequent fact f 1 provides the table attribute t a , then this evidence is represented as MaxFreq (t a , f 1 ). The evidence that fact is a fact f 1 for a table attribute t a is represented as about (f 1 , t a ). b) Inter-dependency between data sources and facts (IDS): The "trustworthy" and "complete" data sources that exist provide more accurate facts than other data sources. The trustworthy and complete data sources fact is likely to be true. The trustworthy, completeness of a source and the accuracy of a fact is represented by Istrustworthy (s 1 ), Iscomplete (s 1 ) and Isaccurate (s 1 ) respectively. The completeness of a data source is calculated using the Eq. 2.

Completeness =
( 2) where, SC is Source completeness, TC is Tuple completeness and AC is Attribute completeness. The source completeness is measured using the Eq. 3.

Source Completeness = NRRS/TNRR
( 3) where, NRRS is Number of Records Retrieved from a Source and TNRR is Total Number of Records Retrieved.
Tuple completeness is measured using the Eq. 4.  (f 1 , f 2 ). The fact f 1 contains the fact f 2 as represented by Contain (f 1 , f 2 ). d) Mutual Dependency (MD) between sources: The two data sources will be dependent on each other if these two data sources provide several similar facts for various table attributes. The facts provided by them for other table attributes may have the same accuracy and completeness. The mutual dependency between data sources is described as InterDep (s 1 , s 2 ). The definition of mutual dependency between sources is defined in Eq. 6.
If the two data sources s 1 and s 2 satisfy Eq. 6 condition, then there exists a dependency between the two data sources.
Where, Fact 1 and Fact 2 represent the set of facts provided by data sources s 1 and s 2 respectively. TA 1 and TA 2 represent the set of table attributes for the data sources s 1 and s 2 that provide the facts. The ∂ is threshold between 0 to 1.

Rule setting
Rules are framed to infer the true values, which are represented as formulas in discriminative MLN Rule 1: Voting In order to identify the correct value from the conflicting value, the voting methodology is used. The inference is that the most frequent fact for a table attribute is accurate. It is represented in Eq. 7.
Rule2: Inter dependency between facts and data sources The data source which provides accurate and complete facts is trustworthy and the facts provided by trustworthy data sources are accurate, as represented in Eqs. 8, 9 and 10 respectively.
IsComplete(f 2 ) Ʌ Provide(s 1 ,f 1 ) ⇒ IsAccurate(s 1 ) Trustworthy (s )ɅProvide(s , f ) ⇒ IsAccurate(s ) Rule3: Mutual implication between facts If two facts have the same content for a table attribute t a , then they have the same completeness and accuracy. Thus, if the content of a fact f 1 contains the one of another fact f 2 , then f 2 is complete and accurate, then f 1 is also complete and accurate. It is represented in Eqs. 11 and 12 respectively.
Rule4: Mutual Implication between data sources If two data sources provide several similar facts for many table attributes, then there exists a mutual dependency between those two sources. It is represented in Eqs. 13 and 14 respectively.

Weight learning and inference
The modified discriminative MLN includes the weight of each of these clauses. The voted perceptron algorithm uses automatic weight learning. Poon and Domingos (2006) proposed voted perceptron algorithm, which is traditional weight learning algorithm used for modified discriminative MLNs. In this algorithm, it fixes all the formula weight to zero besides updating the weight of each formula through training data. When the predicted value of the training set matches the true value, then the weight is assigned. Finally, the average weight of individual iteration rather than the final weight is used to prevent over fitting. MC-SAT algorithm is used for approximation. After the weight learning process is conducted for formulas, the inference is conducted. MC-SAT algorithm is used to determine the values of query predicates. Finally, all the records referring to a table are merged to a single record based on the true value. The user will then get the data set without conflicts.

Results and discussion
For experimentation, E-shopping data of a few enterprises are selected. These enterprises sell electronic gadgets like computer, laptop and television etc. that are heterogeneous and autonomously developed. A unified view is created to resolve the semantic conflict among different heterogeneous databases by using ontology. This view is used by the user for shopping and business analysts for decision support.
To implement the prototype of the DF-MDMLN approach, the following tables have been autonomously created in different enterprises: • Category (category_id, category_name, category_description) • Customer (customer_id, Customer_name, Customer_address, customer_phone_no, Cutomer_email_id) • Products (product_id, category_id,model_name, product_desc, brand, price) • Order (order_id, product_id, customer_id, no_of_products) Here three databases using MYSQL, ORACLE, SQL server are considered. In all these databases the table and the attributes are using different name and are schematically heterogeneous. In these databases, for experimentation 4000 records of each data source is taken. Local and Global ontology have been constructed by using protégé 4.2 tool.
MLN model is developed using the alchemy system, which is an open source software developed at the university of Washington, which provides algorithms and interfaces for modeling MLNs (alchemy.cs.washington.edu). To measure the performance DF-MDMLN, the experiments are performed in the following aspects 1) precision of data fusion; 2) recall of data fusion 3) F-measure of data fusion 4) The effects combination of rules.

Precision of data fusion
The performance of data conflicts is measured via precision. The precision is calculated using formula shown in the Eq. 15. The precision comparison for the proposed MLNs and Truth finder for duplicate detection is shown in Table 1.
The Table 1 shows that the proposed MLN approach gets higher precision over truth finder approach. The experiment concludes that MLN approach improve 40% of precision rate is improved than using truth finder. Thus, the MLN improves the precision by utilizing multidimensional features.

Recall of data fusion
The performance of data conflicts is measured via recall. The recall is calculated using formula shown in the Eq. 16. The recall comparison for the proposed MLNs and Truth finder approach for duplicate detection is shown in Table 2.

Recall = (16)
Table 2 concludes that the recall of data fusion has been improved by 27% in MLNs than the truth finder.

F-Measure for data fusion
The F-measure is calculated using formula shown in the Eq. 17. The F-measure comparison for the proposed MLNs and Truth finder is shown in Table  3.
F − Measure = × × (17) Table 3 shows the F-measure for DF-MDMLN approach. Table 3 concludes that the intrinsic tradeoff between precision and recall. The precision and recall are evenly distributed.

Effects of rules and their combination
The rules proposed in DF-MDMLN are validated. The proposed approach includes all four rules. The Voting is denoted by V, The Interdependency between facts and Data Sources is denoted by IDFS, Mutual Implication between facts is denoted by MIF and Mutual Implication between Data Sources is denoted by MIDS This experiment shows that DF-MDMLN approach can combine the various rules and is shown in Fig. 2. In this approach, we can add and remove the rules conveniently. Because the data federation is dynamic process, the new data conflicts may occur that can be predicted using different combination of these rules. It also shows that adaptability of DF-MDMLN approach.

Conclusion
The DF-MDMLN approach was successfully implemented and shown extensive evaluation on synthetically generated E-shopping data. The proposed method addresses well known and important, yet frequently ignored problem of data quality in data federation. The Modified discriminative Markov Login Networks is used for data conflicts resolution in data fusion. It is found that DF-MDMLN approach is a powerful and practical approach that performs better than truth finder in data fusion respectively.
Further in this method, the best precision and recall is obtained. To achieve higher precision and recall in data fusion, the rules are defined and used to resolve the data conflicts effectively. The results offered a solution to the problem by ensured the quality of the results before providing to the end user of the data federation. Experimental results concluded that the proposed approach improved the recall and precision of end users' data by 40% and 27% respectively. In future work the data quality can further be improved by taking into account additional data quality factors and methods to analyze and process the results. The other data quality factors such as consistency, data freshness and availability can be added and hence the quality of data federation system can be further improved.