1 00:00:00,408 --> 00:00:03,991 In this segment I'm going to show you that dependency syntax 2 00:00:03,991 --> 00:00:09,040 is a very natural representation for relation extraction applications. 3 00:00:10,702 --> 00:00:16,504 One domain in which a lot of work has been done on relation extraction is in the biomedical text domain. 4 00:00:16,504 --> 00:00:19,410 So here for example, we have the sentence 5 00:00:19,410 --> 00:00:26,195 “The results demonstrated that KaiC interacts rhythmically with SasA, KaiA, and KaiB.” 6 00:00:26,195 --> 00:00:30,562 And what we’d like to get out of that is a protein interaction event. 7 00:00:30,562 --> 00:00:34,628 So here’s the “interacts” that indicates the relation, 8 00:00:34,628 --> 00:00:36,746 and these are the proteins involved. 9 00:00:36,746 --> 00:00:40,165 And there are a bunch of other proteins involved as well. 10 00:00:40,535 --> 00:00:48,219 Well, the point we get out of here is that if we can have this kind of dependency syntax, 11 00:00:48,219 --> 00:00:55,213 then it's very easy starting from here to follow along the arguments of the subject and the preposition “with” 12 00:00:55,213 --> 00:00:59,367 and to easily see the relation that we’d like to get out. 13 00:00:59,367 --> 00:01:01,714 And if we're just a little bit cleverer, 14 00:01:01,714 --> 00:01:05,811 we can then also follow along the conjunction relations 15 00:01:05,811 --> 00:01:12,967 and see that KaiC is also interacting with these other two proteins. 16 00:01:14,259 --> 00:01:17,362 And that's something that a lot of people have worked on. 17 00:01:17,362 --> 00:01:24,355 In particular, one representation that’s being widely used for relation extraction applications in biomedicine 18 00:01:24,355 --> 00:01:27,796 is the Stanford dependencies representation. 19 00:01:27,796 --> 00:01:33,639 So the basic form of this representation is as a projective dependency tree. 20 00:01:33,639 --> 00:01:40,699 And it was designed that way so it could be easily generated by postprocessing of phrase structure trees. 21 00:01:40,699 --> 00:01:44,077 So if you have a notion of headedness in the phrase structure tree, 22 00:01:44,077 --> 00:01:49,640 the Stanford dependency software provides a set of matching pattern rules 23 00:01:49,640 --> 00:01:55,291 that will then type the dependency relations and give you out a Stanford dependency tree. 24 00:01:55,291 --> 00:02:01,998 But Stanford dependencies can also be, and now increasingly are generated directly 25 00:02:01,998 --> 00:02:06,749 by dependency parsers such as the MaltParser that we looked at recently. 26 00:02:07,319 --> 00:02:11,470 Okay, so this is roughly what the representation looks like. 27 00:02:11,470 --> 00:02:13,299 So it's just as we saw before, 28 00:02:13,299 --> 00:02:17,855 with the words connected by type dependency arcs. 29 00:02:19,655 --> 00:02:24,240 But something that has been explored in the Stanford dependencies framework 30 00:02:24,240 --> 00:02:27,772 is, starting from that basic dependencies representation, 31 00:02:27,772 --> 00:02:34,053 let’s make some changes to it to facilitate relation extraction applications. 32 00:02:34,053 --> 00:02:38,482 And the idea here is to emphasize the relationships 33 00:02:38,482 --> 00:02:43,302 between content words that are useful for relation extraction applications. 34 00:02:43,302 --> 00:02:45,387 Let me give a couple of examples. 35 00:02:45,387 --> 00:02:51,553 So, one example is that commonly you’ll have a content word like “based” 36 00:02:51,553 --> 00:02:56,599 and where the company here is based—Los Angeles— 37 00:02:56,599 --> 00:03:01,029 and it’s separated by this preposition “in”, a function word. 38 00:03:01,029 --> 00:03:07,101 And you can think of these function words as really functioning like case markers in a lot of other languages. 39 00:03:07,101 --> 00:03:11,410 So it’d seem more useful if we directly connected “based” and “LA”, 40 00:03:11,410 --> 00:03:15,034 and we introduced the relationship of “prep_in”. 41 00:03:15,911 --> 00:03:20,734 And so that’s what we do, and we simplify the structure. 42 00:03:20,734 --> 00:03:22,982 But there are some other places, too, 43 00:03:22,982 --> 00:03:29,649 in which we can do a better job at representing the semantics with some modifications of the graph structure. 44 00:03:29,649 --> 00:03:34,868 And so a particular place of that is these coordination relationships. 45 00:03:34,868 --> 00:03:40,393 So we very directly got here that “Bell makes products”. 46 00:03:40,393 --> 00:03:44,158 But we’d also like to get out that Bell distributes products, 47 00:03:44,158 --> 00:03:51,819 and one way we could do that is by recognizing this “and” relationship 48 00:03:51,819 --> 00:04:01,820 and saying “Okay, well that means that ‘Bell’ should also be the subject of ‘distributing’ 49 00:04:03,159 --> 00:04:07,493 and what they distribute is ‘products.’” 50 00:04:09,432 --> 00:04:11,315 And similarly down here, 51 00:04:11,315 --> 00:04:21,104 we can recognize that they’re computer products as well as electronic products. 52 00:04:21,781 --> 00:04:24,606 So we can make those changes to the graph, 53 00:04:24,606 --> 00:04:28,118 and get a reduced graph representation. 54 00:04:28,595 --> 00:04:33,489 Now, once you do this, there are some things that are not as simple. 55 00:04:33,489 --> 00:04:38,857 In particular, if you look at this structure, it’s no longer a dependency tree 56 00:04:38,857 --> 00:04:43,019 because we have multiple arcs pointing at this node, 57 00:04:43,019 --> 00:04:46,128 and multiple arcs pointing at this node. 58 00:04:47,251 --> 00:04:48,569 But on the other hand, 59 00:04:48,569 --> 00:04:54,588 the relations that we’d like to extract are represented much more directly. 60 00:04:54,588 --> 00:04:58,006 And let me just show you one graph that gives an indication of this. 61 00:04:58,652 --> 00:05:06,422 So, this was a graph that was originally put together by Jari Björne et al, 62 00:05:06,422 --> 00:05:12,465 who were the team that won the BioNLP 2009 shared tasks in relation extraction 63 00:05:12,465 --> 00:05:17,498 using, as the representational substrate, Stanford dependencies. 64 00:05:17,498 --> 00:05:20,677 And what they wanted to illustrate with this graph 65 00:05:20,677 --> 00:05:25,231 is how much more effective dependency structures were 66 00:05:25,231 --> 00:05:30,860 at linking up the words that you wanted to extract in a relation, 67 00:05:30,860 --> 00:05:34,757 than simply looking for words in the linear context. 68 00:05:35,434 --> 00:05:40,401 So, here what we have is that this is the distance 69 00:05:40,925 --> 00:05:45,891 which can be measured either by just counting words to the left or right, 70 00:05:45,891 --> 00:05:50,042 or by counting the number of dependency arcs that you have to follow. 71 00:05:50,042 --> 00:05:53,324 And this is the percent of time that it occurred. 72 00:05:53,324 --> 00:05:56,337 And so what you see is, if you just look at linear distance, 73 00:05:56,337 --> 00:06:02,892 there are lots of times that there are arguments and relations that you want to connect out 74 00:06:02,892 --> 00:06:06,223 that are four, five, six, seven, eight words away. 75 00:06:06,223 --> 00:06:11,726 In fact, there’s even a pretty large residue here of well over ten percent 76 00:06:11,726 --> 00:06:16,768 where the linear distance away in words is greater than ten words. 77 00:06:16,768 --> 00:06:21,176 If on the other hand though, you are trying to identify, 78 00:06:21,176 --> 00:06:25,636 relate the arguments of relations by looking at the dependency distance, 79 00:06:25,636 --> 00:06:30,460 then what you’d discover is that the vast majority of the arguments 80 00:06:30,460 --> 00:06:35,428 are very close-by neighbors in terms of dependency distance. 81 00:06:35,428 --> 00:06:42,068 So, about 47 percent of them are direct dependencies and another 30 percent of distance too. 82 00:06:42,068 --> 00:06:48,512 So take those together and that’s greater than three quarters of the dependencies that you want to find. 83 00:06:48,512 --> 00:06:51,537 And then this number trails away quickly. 84 00:06:51,537 --> 00:06:59,431 So there are virtually no arguments of relations that aren’t fairly close together in dependency distance 85 00:06:59,431 --> 00:07:02,621 and it’s precisely because of this reason that you can get 86 00:07:02,621 --> 00:07:09,617 a lot of mileage in doing relation extraction by having a representation-like dependency syntax. 87 00:07:11,447 --> 00:07:16,050 Okay, I hope that’s given you some idea of why knowing about syntax is useful, 88 00:07:16,050 --> 99:59:59,999 when you want to do various semantic tasks in natural language processing.