Markov logic for statistical relation extraction

In today’s world, the Internet has become a fast and efficient information provider, although the relevancy or accuracy of the information found is not guaranteed. The web itself presents numerous problems in finding a required piece of information, mainly due to its heterogeneous nature. Therefore, extracting information from the web is still a challenging task despite the fact that numerous work has been cited in the literature. Extracting information in the form of entities and relations has been addressed by various techniques such as machine learning, natural language processing, and statistical methods, etc. In this paper, we present a rule-based method which is hybridized by machine learning and statistical techniques for accurate performance in domain-specific relation extraction. The rules are modeled in Markov Logic Network to enable statistical performance. Our results on two test domains show overall high values in precision.


Introduction
*The web being the latest and fastest information provider, people prefer accessing it for their information needs. However, finding a specific piece of information from a massive collection of web sources is a tedious, time-consuming task for a human being. Information extraction, concept definition from various web sources, and text mining are required processes for identifying categories of entities and relationships. Various tasks such as question answering, populating knowledge bases, generating and populating domain ontology essentially require automatic extraction of information from the web resources in order to avoid the time-consuming tedious process of information searching. We focus on extracting relations for domain ontology construction and population.
Despite the effort made by the researchers, extracting relations from unstructured text still remains a complicated task that requires new and refined techniques. Since rule-based systems are declarative and easy to comprehend, maintain, and incorporate domain knowledge (Chiticariu et al., 2013), these systems are widely used in information extraction. Some systems use hand-coded rules (Drumond and Girardi, 2009), and some systems use machine learning algorithms to induce rules from training data (Aitken, 2002;Celjuska and Vargas-Vera, 2004). The rules need to be weighed to reflect their strength, which contributes to finding the probability of an extracted relation instance. Therefore, finding the weight of a rule is very necessary for determining the accuracy of the extracted information. However, most of the previous rule-based information extraction systems either lack a weight learning process (Aitken, 2002;Ciravenga and Wills, 2003;Yildiz and Miksch, 2011;Lima et al., 2019) or employ a poor weight learning method (Celjuska and Vargas-Vera, 2004;Mooney and Nahm, 2002;Drumond and Girardi, 2009). This implies the necessity of a proper weight learning process in rule-based systems. Markov Logic that accomplishes weight learning for first-order formulas can be investigated for the possibility of weight learning in rule-based information extraction systems and hence for statistical relation extraction.
Current research focuses on information extraction that has turned into distant supervision based techniques that use entities and relations in a large knowledge base such as freebase to annotate the sentences in a text corpus. The problem with this kind of annotation is that some sentences which contain the two entities do not give the expected relation. They might indicate a different relation or might not give a relation at all. Pairwise convolutional neural network (PCNN) is used in most distant supervised based research work (Ji et al., 2017;Zeng et al., 2018) to represent sentences in order to select them for training a text corpus to make the full use of supervision information.
Although statistical machine learning has become the choice of many recent academic researchers in information extraction, rule-based methods find higher applicability in the practical environment and dominate the commercial world (Chiticariu et al., 2013). Therefore, we create a hybridized rule-based system with statistical machine learning in a semisupervised manner to extract relations from unstructured text. Our system uses Inductive logic programming to generate rules from language dependencies and Markov Logic Network (MLN) to learn the weights for the rules during the statistical relation extraction process. The dependencies between the constituents in a sentence can be used effectively to identify the relationships existing within the sentence. Here we use these language characteristics effectively in relation-extractionrules to achieve high precision, and each rule is evaluated by the certainty of the extracted relation instances.

Related works
Relation extraction has been performed by statistical machine learning, rule-based techniques, and natural language processing over the years. Many supervised learning systems induce extraction rules based on identified patterns in the training data set. In identifying patterns, the learning algorithms use the features of labeled data and their neighborhood words. Therefore, the successful application of extraction rules depends on the identification of appropriate neighborhood and features of the language tokens.
PubMiner (Eom and Zhang, 2004), which has been developed to extract entities and relationships from massive biological literature, extracts verbs from a sentence as events and finds the binary relation between two name entities identified in the sentence. Although PubMiner is capable of extracting both entities and relationships, it also extracts many false positives in the domain. In finding relations, PubMiner heavily depends on external resources such as public medical databases and treasures. Wang et al. (2006) have addressed hierarchical relation extraction using SVM based approach. They have experimented on Doddington et al. (2004) training data, which defines a hierarchy of relations with 7 top types and 22 subtypes. These relation types include the most commonly used general relationships. Pawar et al. (2016) have also used rules involving three maximum entropy classifiers based on entity features to predict entity types and relation types. Their work is also based on Doddington et al. (2004) data set, and therefore the entity types and relation types are restricted to Doddington et al. (2004) entity and relation types except for NONE and NULL types. Weight learning methods were not used, although rules are modeled in the Markov logic network. Weights of the rules are determined by weight assignment strategies based on odd log ratio and constant multiplier. Drumond and Girardi (2009) also modeled a set of rules in Markov Logic Network and rules wrap extracted terms based on tf-idf measure in first-order logic. Three hand-coded rules, including one rule using language dependencies, infer a term into a concept. They use dependencies to identify a concept based on the fact that two terms bearing the same syntactic dependencies with the same term denote the same concept. This method can cluster textural surface forms that have similar meaning using latent relation models, but clustering does not provide reliable implications in generalization. Yao et al. (2011) used features from the dependency path between entity mentions in a generative probabilistic model in their unsupervised approach for relation extraction.
Recent neural network-based clinical relation extraction systems (Li et al., 2019;Ningthoujam et al., 2019) exploit shortest path dependency (SPD) between target entities. Li et al. (2019) used SPD information features concatenated with word embedding of sentence sequences and demonstrated better performance with SPD, whereas Ningthoujam et al. (2019) system relied only on the SDP. In our approach, we process the dependencies of a sentence to eliminate the unwanted dependencies with respect to the relationship between the target entities. Then in the rule generation, process rules are created from the shortest dependency path between the target entities.
Open Information Extraction (IE) systems are employed to capture all types of relations present in text documents. Open IE systems extract relational tuples from text without requiring predefined vocabulary by identifying relation phrases and associated arguments in arbitrary sentences (Schmitz et al., 2012). REVERB Etzioni et al., 2011) an WOE (Wu and Weld, 2010) are such state-of-the-art open IE systems. OLLIE (Schmitz et al., 2012) is an improved open IE system developed to address the weaknesses of REVERB and WOE. One of the weaknesses pointed by OLLIE is the extraction of relations, which are not asserted as factual in the sentence. It is addressed by our system, as explained in section 3.2, although there are limitations. Another weakness of the state-of-the-art open IE system is the extraction of relations mediated only by verbs. Although we also focus on identifying the related verb in the sentence, some relations such as has_characteristics, which is usable in a wide range of domains, are grasped by the adjectives in our system. Carlson et al. (2010) employed a semi-supervised approach by using seed instances and patterns on a large corpus to extract more instances and patterns for concepts and relations. They try to reduce the effect of semantic drift by ranking the extracted instances and newly found patterns to select information for further extraction. We also use a semi-supervised method where the training corpus is updated with selected relation instances extracted in the process.
Novel approach in Mintz et al. (2009) work is the distance supervision method. In their method, they use relations in freebase to extract lexical and syntactic features for feature vectors of the multiclass logistic classifier. Aljamel et al. (2015) used distant supervision machine learning techniques for domain-specific relation extraction. They use GATE to extract entities and identify features to use in machine learning for relation extraction. The distant supervision methods have limitations mentioned above in section 1, and they can lead to false positives in automatically annotating a training corpus. Naturally, the predictions of distant supervised methods are subjective, depending on the availability of domainspecific information in the knowledge base.

Statistical relation extraction
We use inductive logic programming (ILP) to generate relation-extraction-rules from dependency clauses (Seneviratne and Ranasinghe, 2011). Therefore, language dependencies are preprocessed in order to reduce them by filtering irrelevant clauses out and collapsing nouns to form entities (Seneviratne and Ranasinghe, 2014). In extracting relations from sentences, we use conditional statements present in sentences for correct identification of relation instances. We use the Markov Logic Network (Domingos and Richardson, 2006) to model the relation-extraction-rules to find rule weights for statistical relation extraction.

Generation of relation extraction rules
A verb is the powerful lexical term which binds two adjacent syntactic categories, and a relation can be defined as a predicate expression of two nouns, i.e., subject and object wrapped in syntactic categories as follows: Verb (Subject, Object) or, Verb_Prep (Subject, Object) Verb_prep is the form of the verb combined with a preposition. Therefore, a relation can be identified by verb constituent of the sentence. When there is more than one verb present in a sentence, the most suitable verb should be identified as a related verb.
The process of rule generation employs the inductive logic programming (ILP) technique (Dzeroski and Lavrac, 1994) in the learning algorithm to derive a set of rules from the dependencies based on the text annotated with the entities. The text processing tool GATE (Cunningham et al., 2002) is used to identify entities and sentences annotated with entities that are parsed by Stanford parser producing dependencies. The rule learning process uses the output of the Stanford parser to learn rules to extract relation instances for a known relation such as located_in, part_of, feed_on, etc., and some of which are domain-specific relations.
For example, the sentence s(1):  The diet of Ostrich mainly consists of plant matter, though it eats insects s (1) gives the relation instances:

feed_on (Ostrich, plant_matter) feed_on (Ostrich, insects)
The example sentence gives the following Stanford collapsed Type dependencies: (consists-6, mainly-5) nn (matter-9, plant-8) prep_of (consists-6, matter-9) mark (eats-13, though-11) nsubj (eats-13, it-12) advcl (consists-6, eats-13) dobj (eats-13, insects-14) The reduced dependencies are: nsubj (consists-6, diet_of_Ostrich-4) prep_of (consists-6, plant_matter-9) nsubj (eats-13, it-12) dobj (eats-13, insects-14) The reduced dependencies of the sentences which contain positive or negative instances for relation and syntactic tags provide background knowledge to learn relation-extraction-rules. At the end of the process, we have a set of rules for each relation considered. Initially, we consider a set of predefined relations, but during the extraction process, new relations can also be identified. Fig. 1 shows the preprocessing sequence of the training documents, and Fig. 2 shows the input and output of both rule learning and relation extraction processes. In the training phase, a set of verbs are identified as equivalent verbs, including synonyms for each relation. The negative verb set includes antonyms and verbs relevant to any other relation existing between the same entities. The sets of positive and negative verbs for relation are obtained during the rule learning process. In the end, rules are presented in first-order predicate form, as shown in the following example: In the domain Bird a relation-extraction-rule generated for the relations feed_on() is, where,

Identifying relative clauses and conjunctions
Extracting information accurately, from the compound sentences and sentences which consist of relative clauses or conjunctions should be addressed further, paying attention to connecting words used in each of those categories. In compound sentences, two independent clauses are connected by compound words. Therefore, sentence constituents connected by compound words can be processed separately, and all the constituents in the sentence can be considered as true statements in the normal way as in simple sentences. For example, from the sentence s(1) shown in section 3.1, two instances of the relation feed-on() can be extracted independently. Since relative clauses describe nouns and are directly addressed by the dependencies, further processing is not needed in generating relation-extraction-rules. Sentence s(2) is an example of a sentence with a relative clause:  The West Indian whistling duck is a whistling duck that breeds in the Caribbean s (2) But with conjunctions, an extra effort is needed to retain the accuracy of the information given by the sentence because the truth value of the sentence depends on some conjunctions. Some conjunctions are strictly conditional and have a direct effect on the truth value of the information given by a sentence depending on the truth value of a constituent coming with conjunction. Sentences can be made more informative with some conjunctions, and the truth value of the sentence does not depend on the truth value of the conjunctive part. Therefore, unconditional conjunctions can be handled in the same way as compound sentences. For example, unconditional conjunction "and" in the sentence s(3) have no effect on the truth value of either part.
 A standard fistball is hollow, filled with air, and is made of leather. s (3) We show commonly used conditional and unconditional conjunctions in Table 1. Conditions that come with the conjunctions can be considered as a restriction for the relationship to be true. Information wrapped in the sentence fragment with the conditional conjunction can be captured separately and presented as a condition/restriction for information given by the other fragment of the sentence. For example, the sentence s(4) gives the information that ostriches can have a minimum speed of 70 km/h under the condition "pursued by a predator."

 When being pursued by a predator, ostriches have been known to reach speeds in excess of 70 km/h s(4)
Then the relation instance has_speed(Ostrich, 70_km/h) will be true if the condition pursued_by (Ostrich, predator) holds. Fig. 3 shows the way the sentences are categorized with respect to relation extraction, accommodating four example sentences.

Fig. 3: Main sentence categories with respect to information extraction
In extracting the condition, the dependencies are searched to find the verb constituent in the conjunctive part. Then that verb is taken as a relation predicate, and it is attributed to the subjective noun or an entity where applicable and the closest noun to the conjunctive verb.
There are limited numbers of conjunctions, as shown in Table 1 in their categorized form. Then conjunctions can be identified by the lexical term itself

Modelling relation-extraction-rules on Markov logic network
MLN combines first-order logic with the probabilistic model and requires grounding all the first-order clauses by substituting constants for all the variables in them (Domingos and Richardson, 2006). The probability distribution over possible worlds x specified by the ground Markov network is given by Eq. 1 and Eq. 2: where, where, wi is the weight of the i th clause, and ni is the number of true occurrences of the i th clause. Verbs and entity instances in the training data corpus are used to ground the relation-extractionrules. Since the number of grounding is intractable with a large number of substitutions, reducing the number of clauses in the condition of the rules is vital for efficient implementation before MLN is used on them. Sets of negative and positive verbs are obtained during the implementation of the ILP method for rule generation. Negative verb clauses (negative(VB)) can be omitted from the rules because all the verbs used in MLN are positive verbs. Although the atomneg(VB, not) does not contain any entity mentions, it is relevant to a particular pair of entity instances. In the example rule, it is relevant to Bird and Bird_food instances. But the atom itself does not contain Bird or Bird_food variables because the rules normally applied to the reduced dependencies of a sentence. The negative literal prep_except( (NN, y) is not necessarily relevant to one rule and can be added to all the rules to avoid the extraction of false positives. Therefore omitting neg(VB, not) and prep_except((NN, y) does not have a significant impact on the rule weight. Then the above-mentioned example rule for the relation feed_on(Bird, Bird_food) in section 3.1 is reduced to a rule with only three clauses as follows for the MLN weight learning process.
However, in the relation extraction process, the complete rule with all the clauses is applied.
Identified entities and verbs from the dependencies are used in grounding the relationextraction-rules when modeling them in MLN. In addition, the knowledge base consists of evidence that is considered as known as atoms from the training corpus.
MLN requires counting the number of true groundings of formula at a given world state. The probabilistic state-space created by a large database is intractable to do these counting. The higher the number of objects in the MLN, the more difficult the computations become. In this situation, the state space can be reduced by removing the known true literals from the MLN. The negative verbs are not used in grounding the rules, but negative relation instances are used as evidence. Furthermore, the main relation verb can replace all the equivalent verbs generated during the ILP process. In this way number of atoms in the initial MLN can approximately be reduced to a number which fluctuates around half of the initial number of atoms, depending on the number of evidence atoms available.
We use discriminative learning to find weights for relation-extraction-rules. In discriminative learning, the conditional likelihood of query atoms is used (Singla and Domingos, 2005). The conditional likelihood of query atoms y given evidence atoms x is shown by Eq. 3: where, Fy is the set of all MLN clauses with at least one grounding involving a query atom, and ni(x,y) is the number of true groundings of the i th clause involving query atoms. The gradient of the Conditional log-likelihood is given by Although the number of grounded atoms can be reduced as explained above, computing expected counts EW is intractable. Closed World Assumption cannot be used with the dependency literals because the domain is infinite, although a limited number of training data is used in the experiment. Therefore EW can be approximated by the counts ni(x,yW*) in the MAP(Maximum A Posteriori) state. In the problem domain given under experimental results, finding a single MAP state is not guaranteed because the same conditional probability value exists for the number of states. Therefore, Contrastive Divergence (CD) (Lowd and Domingos, 2007) is used in gradient calculations instead of using MAP state. CD approximates the expectations from a small number of Monte Carlo Markov Chain (MCMC) samples. Gibbs sampling is chosen with CD in order to create samples of states. Each Gibbs step consists of sampling a ground atom when its Markov blanket is given. Gibbs sampling requires weights of rules in its sampling process. The weight of a rule is calculated basically for Gibbs sampling by the log odds between a world where the rule is true and a world where the rule is false when other things are equal.
Algorithm for the construction of all the groundings with respect to relation-extraction-rules:

For each variable x in Fi For each clause Fj (x) If the type of x is entity1
Obtain the ground clauses substituting all the values from E1 If the type of x is entity2 where c1, c2, c3, represent the members of E1, E2, PVB, NN) f← ( fƯ ).
The probability of a ground atom Xl with respect to a Markov Blanket Bl is given by Eq. 5: Eq. 4 poses a multivariate weight optimization problem. Gradient Descent, Diagonal Newton, and Conjugate Gradient (Lowd and Domingos, 2007) are available multivariate optimization techniques for efficient weight learning for MLN. Gradient Descent is comparatively slow, and Diagonal Newton has limitations in uncorrelated clauses. Therefore, we prefer the Conjugate Gradient method for weight optimization. In the Conjugate Gradient method, search directions are constructed by conjugation of residuals, and the Polak-Ribiere method (Shewchuk, 1994) is used to find conjugate direction though there are several equivalent expressions for this. The Polak-Ribiere method often converges much more quickly. We use JAVA to generate relationextraction-rules and MATLAB in the Windows environment in the implementation of relation extraction and weight leaning. An overview of the weight learning process is shown in Fig. 4.
In the weight learning process, the strength of a rule is determined by the number of false occurrences that can be covered by the preliminary rule as well as the number of true occurrences for the rule in the training corpus.
A set of rules for the relation feed_on() is given below along with the learned weights,

Extraction of relations
The Stanford dependencies of sentences of the known entities in the document are searched to find the compatibility of an extraction rule with the dependencies. Entity instances in sentences covered by the rules of a particular relation are extracted as the attribute values of that relation. When a sentence cannot be covered by extraction rules, the positive and negative verb sets are searched in order to find out whether the main verb constituent is equivalent to any of the verbs in the two sets. Sentences of entities not extracted as a relation by existing relation-extraction-rules can be processed in order to find whether the entities form a negative relation or a new relation.

Evidence and nonevidence
Ground the clauses in the set of extraction rules for a relation Use CD to find the Expectation Find the gradient with respect to rule weights using Eq. 4 Use Scaled conjugate gradient to find optimal Weights of the relation-extractionrules Fig. 4: Overview of the weight learning process An ambiguous sentence with respect to extraction rules can be categorized into one of the following situations: i.
Verb unknown, but extraction rules cover the dependencies, Eg. The Cape Barren goose is a large goose resident in southern Australia ii.
Verb known, but extraction rules cannot cover the dependencies, Eg. There are subspecies of Barn Swallow which breed across the Northern Hemisphere. iii.
Verb unknown and extraction rules cannot cover the dependencies.
Sentences in category (iii) are assumed to be formed a completely new relationship and they are used to formulate the new relation. The relation is labeled by the main verb of the sentence, i.e., the verb constituent contained in the atomic formula "nsubj" when there is only one verb in the dependencies.
We calculate the probability of a rule F1 in extracting relation instances with the given evidence F2 on MLN by Eq. 6.
where, L is the MLN, C is the set of constants, and XFi is the set of states that Fi holds and XF2 is the set of states that F2 holds.

Experimental results
The proposed relation extraction system is used on the domains Bird and Sport to extract relation instances existing between annotated entities. Since the training data set is continuously updated by the system, a rather small number of Wikipedia pages (100) are used initially. The rule generation process is continued with the updated corpus to learn new relation-extraction-rules, which are added to the existing rule base. The pages are used as they are in the Wikipedia for entity extraction. Once the entities are annotated by the system, only language dependencies of the sentences annotated with entities are used to learn the rules for relation extraction. Then the reduced dependencies, sets of entities present, the relation verb, sets of other verbs and nouns, and adjectives/adverbs in the sentence are the data sources for the rule generation process. The statistics of the relation types in both domains are given in Fig. 5 and Fig. 6. The probability values shown are the values with respect to the best rule in each relation type. Table 2 shows the evaluation measures for the relations considered in the domains Bird and Sport.
Relation-extraction-rules are applied to test documents from Wikipedia, and relation instances are identified. Nonambiguous sentences with respect to relation-extraction-rules give the instances for the relations with higher probabilities. The certainty of the extracted relation instances is measured by probability calculations done according to the Eq. 6. The probability calculations here are based on the dependencies of individual sentences, not on the entire knowledge base, which is used for weight calculations. Each relationextraction-rule is invoked independently, and the knowledge base has no impact on the relation extraction.
Playing method is identified as an entity, and "Played" is a common term for all the playing method relations such as striking, shooting, passing, etc. Since relation extraction is performed based on the entities identified in the entity extraction phase, both extraction processes are mutually exclusive events. If the entity identification is inaccurate, the relation identification between incorrect entities is bound to be false. However, relation extraction is evaluated independently of the accuracy of entity extraction because techniques have been used and developed for both extraction processes independently. Therefore, when evaluating a relation extraction, a 100% accuracy is assumed for extracted entities. The same number of Wikipedia documents have been used in both entity and relation extraction. When analyzing the results, few points can be readily identified. In the case of measurement, relations has_length, has_weight etc., the main reason for low precision is the presence of the measurements with comparative adjectives such as more than, less than. For an example the sentence "The ball weights approximately 100 grams more than the volleyball one" contains the entity types Tool and Weight; but does not give the relation has_weight() correctly. Similarly, incorrect identification of equivalent verbs for a relation when the verb is unknown obviously has an impact on the precision. In the domain Sport there are sentences annotated with the entity type sport more than once in many test documents. These sentences can give the relations is_similar_to(Sport, Sport) or is_version_of(Sport, Sport) which have not been considered in the initial relation extraction task.
Since the availability of appropriate results for relation extraction similar to relations considered in our experimental domains is scarce in the literature, we first select two approaches to compare the overall performance of relation-extraction-rules. Both the approaches (Wang et al., 2006;Pawar et al., 2016) use Doddington et al. (2004)  Domain-Sport comparatively higher number of predefined relations. They also represent two time periods along the research line of information extraction. Furthermore, Pawar et al. (2016) also used MLN to model the rules generated in their system. Table 3 shows the overall comparative performance of relation-extraction rules. Evaluation measures used for Relation-extraction-rules in Table 3 are averages of all the individual measures.  Secondly, we again select two different approaches (Carlson et al., 2010;Yao et al., 2011) to make a comparison of individual relations from each domain. In the domain Bird, the relation located_in() can be compared with the relation liveIn form Yao at el.'s (2011) approach though arguments of Located_in relation are Bird and Location whereas arguments of the liveIn relation are Person and Location. In the domain Sport, the relation play_with() can be compared with the relation SportUsesSportsEquipment from Carlson et al.'s (2010) approach. It uses constraints to a couple of semi-supervised learning. Three coupling algorithms CPL, CSEAL, and MBL have been developed in their approach to information extraction. The results are shown in Table 4. The CSEAL algorithm achieves 100% precision, but it is claimed in their publication that the MBL gives the overall best performance, and CSEAL incurs some loss in a recall.
Although a smaller number of training examples are used to initiate the system, it will not affect the performance of the system because any situation that cannot be covered by the extraction rules is considered as an instance for a new relation and a new extraction rule is generated for the relation accordingly. In addition to that, the training set is continuously expanded by the information relevant to extracted relation instances. Therefore, the use of a smaller training data set becomes an advantage here and has no adverse effect on the performance of the entire system. With the expanded training corpus, the performance of the system is expected to be improved further.

Conclusion
We have presented a method to use language dependency clauses successfully in a rule-based system for statistical relation extraction. Statistical relation traction is enabled by modeling the rules in the Markov logic environment for weight learning. We also have discussed the extraction of relation instances from compound sentences, considering the conditions embedded in the sentence. The applicability of relation-extraction-rules is demonstrated in two different domains with two different sets of relations, although some relations are usable over a wide range of domains with different entities. We initially start the training with a rather small corpus of selected relation instances which cover a range of sentence structures. The training corpus is expanded with extracted instances, selected by a simple statistical method that can be enhanced further. We have also shown some limitations of the system, and it can be extended to address the issues. Current relation extraction relies on entity extraction done by a different method, but entity extraction can also be done by the same method inducing rules for entity identification. This method can be used as a verification of entity extraction, and then errors propagating from incorrect entity extraction can be minimized.