WEBVTT 00:00:00.408 --> 00:00:03.991 In this segment I'm going to show you that dependency syntax 00:00:03.991 --> 00:00:09.040 is a very natural representation for relation extraction applications. 00:00:10.702 --> 00:00:16.504 One domain in which a lot of work has been done on relation extraction is in the biomedical text domain. 00:00:16.504 --> 00:00:19.410 So here for example, we have the sentence 00:00:19.410 --> 00:00:26.195 “The results demonstrated that KaiC interacts rhythmically with SasA, KaiA, and KaiB.” 00:00:26.195 --> 00:00:30.562 And what we’d like to get out of that is a protein interaction event. 00:00:30.562 --> 00:00:34.628 So here’s the “interacts” that indicates the relation, 00:00:34.628 --> 00:00:36.746 and these are the proteins involved. 00:00:36.746 --> 00:00:40.165 And there are a bunch of other proteins involved as well. 00:00:40.535 --> 00:00:48.219 Well, the point we get out of here is that if we can have this kind of dependency syntax, 00:00:48.219 --> 00:00:55.213 then it's very easy starting from here to follow along the arguments of the subject and the preposition “with” 00:00:55.213 --> 00:00:59.367 and to easily see the relation that we’d like to get out. 00:00:59.367 --> 00:01:01.714 And if we're just a little bit cleverer, 00:01:01.714 --> 00:01:05.811 we can then also follow along the conjunction relations 00:01:05.811 --> 00:01:12.967 and see that KaiC is also interacting with these other two proteins. 00:01:14.259 --> 00:01:17.362 And that's something that a lot of people have worked on. 00:01:17.362 --> 00:01:24.355 In particular, one representation that’s being widely used for relation extraction applications in biomedicine 00:01:24.355 --> 00:01:27.796 is the Stanford dependencies representation. 00:01:27.796 --> 00:01:33.639 So the basic form of this representation is as a projective dependency tree. 00:01:33.639 --> 00:01:40.699 And it was designed that way so it could be easily generated by postprocessing of phrase structure trees. 00:01:40.699 --> 00:01:44.077 So if you have a notion of headedness in the phrase structure tree, 00:01:44.077 --> 00:01:49.640 the Stanford dependency software provides a set of matching pattern rules 00:01:49.640 --> 00:01:55.291 that will then type the dependency relations and give you out a Stanford dependency tree. 00:01:55.291 --> 00:02:01.998 But Stanford dependencies can also be, and now increasingly are generated directly 00:02:01.998 --> 00:02:06.749 by dependency parsers such as the MaltParser that we looked at recently. 00:02:07.319 --> 00:02:11.470 Okay, so this is roughly what the representation looks like. 00:02:11.470 --> 00:02:13.299 So it's just as we saw before, 00:02:13.299 --> 00:02:17.855 with the words connected by type dependency arcs. 00:02:19.655 --> 00:02:24.240 But something that has been explored in the Stanford dependencies framework 00:02:24.240 --> 00:02:27.772 is, starting from that basic dependencies representation, 00:02:27.772 --> 00:02:34.053 let’s make some changes to it to facilitate relation extraction applications. 00:02:34.053 --> 00:02:38.482 And the idea here is to emphasize the relationships 00:02:38.482 --> 00:02:43.302 between content words that are useful for relation extraction applications. 00:02:43.302 --> 00:02:45.387 Let me give a couple of examples. 00:02:45.387 --> 00:02:51.553 So, one example is that commonly you’ll have a content word like “based” 00:02:51.553 --> 00:02:56.599 and where the company here is based—Los Angeles— 00:02:56.599 --> 00:03:01.029 and it’s separated by this preposition “in”, a function word. 00:03:01.029 --> 00:03:07.101 And you can think of these function words as really functioning like case markers in a lot of other languages. 00:03:07.101 --> 00:03:11.410 So it’d seem more useful if we directly connected “based” and “LA”, 00:03:11.410 --> 00:03:15.034 and we introduced the relationship of “prep_in”. 00:03:15.911 --> 00:03:20.734 And so that’s what we do, and we simplify the structure. 00:03:20.734 --> 00:03:22.982 But there are some other places, too, 00:03:22.982 --> 00:03:29.649 in which we can do a better job at representing the semantics with some modifications of the graph structure. 00:03:29.649 --> 00:03:34.868 And so a particular place of that is these coordination relationships. 00:03:34.868 --> 00:03:40.393 So we very directly got here that “Bell makes products”. 00:03:40.393 --> 00:03:44.158 But we’d also like to get out that Bell distributes products, 00:03:44.158 --> 00:03:51.819 and one way we could do that is by recognizing this “and” relationship 00:03:51.819 --> 00:04:01.820 and saying “Okay, well that means that ‘Bell’ should also be the subject of ‘distributing’ 00:04:03.159 --> 00:04:07.493 and what they distribute is ‘products.’” 00:04:09.432 --> 00:04:11.315 And similarly down here, 00:04:11.315 --> 00:04:21.104 we can recognize that they’re computer products as well as electronic products. 00:04:21.781 --> 00:04:24.606 So we can make those changes to the graph, 00:04:24.606 --> 00:04:28.118 and get a reduced graph representation. 00:04:28.595 --> 00:04:33.489 Now, once you do this, there are some things that are not as simple. 00:04:33.489 --> 00:04:38.857 In particular, if you look at this structure, it’s no longer a dependency tree 00:04:38.857 --> 00:04:43.019 because we have multiple arcs pointing at this node, 00:04:43.019 --> 00:04:46.128 and multiple arcs pointing at this node. 00:04:47.251 --> 00:04:48.569 But on the other hand, 00:04:48.569 --> 00:04:54.588 the relations that we’d like to extract are represented much more directly. 00:04:54.588 --> 00:04:58.006 And let me just show you one graph that gives an indication of this. 00:04:58.652 --> 00:05:06.422 So, this was a graph that was originally put together by Jari Björne et al, 00:05:06.422 --> 00:05:12.465 who were the team that won the BioNLP 2009 shared tasks in relation extraction 00:05:12.465 --> 00:05:17.498 using, as the representational substrate, Stanford dependencies. 00:05:17.498 --> 00:05:20.677 And what they wanted to illustrate with this graph 00:05:20.677 --> 00:05:25.231 is how much more effective dependency structures were 00:05:25.231 --> 00:05:30.860 at linking up the words that you wanted to extract in a relation, 00:05:30.860 --> 00:05:34.757 than simply looking for words in the linear context. 00:05:35.434 --> 00:05:40.401 So, here what we have is that this is the distance 00:05:40.925 --> 00:05:45.891 which can be measured either by just counting words to the left or right, 00:05:45.891 --> 00:05:50.042 or by counting the number of dependency arcs that you have to follow. 00:05:50.042 --> 00:05:53.324 And this is the percent of time that it occurred. 00:05:53.324 --> 00:05:56.337 And so what you see is, if you just look at linear distance, 00:05:56.337 --> 00:06:02.892 there are lots of times that there are arguments and relations that you want to connect out 00:06:02.892 --> 00:06:06.223 that are four, five, six, seven, eight words away. 00:06:06.223 --> 00:06:11.726 In fact, there’s even a pretty large residue here of well over ten percent 00:06:11.726 --> 00:06:16.768 where the linear distance away in words is greater than ten words. 00:06:16.768 --> 00:06:21.176 If on the other hand though, you are trying to identify, 00:06:21.176 --> 00:06:25.636 relate the arguments of relations by looking at the dependency distance, 00:06:25.636 --> 00:06:30.460 then what you’d discover is that the vast majority of the arguments 00:06:30.460 --> 00:06:35.428 are very close-by neighbors in terms of dependency distance. 00:06:35.428 --> 00:06:42.068 So, about 47 percent of them are direct dependencies and another 30 percent of distance too. 00:06:42.068 --> 00:06:48.512 So take those together and that’s greater than three quarters of the dependencies that you want to find. 00:06:48.512 --> 00:06:51.537 And then this number trails away quickly. 00:06:51.537 --> 00:06:59.431 So there are virtually no arguments of relations that aren’t fairly close together in dependency distance 00:06:59.431 --> 00:07:02.621 and it’s precisely because of this reason that you can get 00:07:02.621 --> 00:07:09.617 a lot of mileage in doing relation extraction by having a representation-like dependency syntax. 00:07:11.447 --> 00:07:16.050 Okay, I hope that’s given you some idea of why knowing about syntax is useful, 00:07:16.050 --> 99:59:59.999 when you want to do various semantic tasks in natural language processing.