So, in this video, we'll graduate beyond the domain of just vanilla binary search trees, like we've been talking about before, and we'll start talking about balanced binary search trees. These are the search trees you'd really want to use when you want to have real time guarantees on your operation time. Cuz they're search trees which are guaranteed to stay balanced, which means the height is guaranteed to stay logarithmic, which means all of the operations search trees support that we know and love, will also be a logarithmic in the number of keys that they're storing. So, let's just quickly recap. What is the basic structure tree property? It should be the case that at every single node of your search tree, if you go to the left, you'll only see keys that are smaller than where you started and if you go to the right you only see keys that are bigger than where you started. And a really important observation, which is that, given a set of keys, there's going to be lot and lots of legitimate, valid, binary search trees with those keys. So, we've been having these running examples where the keys one, two, three, four, five. On the one hand, you can have a nice and balanced search tree that has height only two, with the keys one through five. On the other hand, you can also have these crazy chains, basically devolved to link lists where the heights for, and elements could be as high as N - 1. So, in general, you could have an exponential difference in the height. It can be as small, in the best case, as logarithmic and as big, in the worst case, as linear. So, this obviously motivates search trees that have the additional property that you never have to worry about their height. You know they're going to be well balanced. You know they're going to have height logarithmic. You're never worried about them having this really lousy linear height. Remember, why it's so important to have a small height? It's because the running time of all of the operations of search trees depends on the height. You want to do search, you want to insertions, you want to find predecessors or whatever, the height is going to be what governs the running time of all those properties. So, the high level idea behind balanced search trees is really exactly what you think, which is that, you know, because the height can't be any better than logarithmic in the number of things you're storing, that's because the trees are binary so the number of nodes can only double each level so you need a logarithmic number of levels to accommodate everything that you are storing. But it's got to be logarithmic, lets make sure it stays logarithmic all the time, even as we do insertions and deletions. If we can do that, then we get a very rich collection of supported operations all running in logarithmic time. As usual, n denotes, the number of keys being stored in the tree. There are many, many, many different balanced search trees. They're not super, most of them are not super different from each other. I'm going to talk about one of the more popular ones which are called Red Black Trees. So, these were invented back in the '70s. These were not the first balanced binary search tree data structures, that honor belongs to AVL trees, which again are not very different from red black trees, though the invariants are slightly different. Another thing you might want to look up and read about is a very cool data structure called splay trees, due to Sleator and Tarjan, These, unlike red black trees and AVL trees, which only are modified on insertions and deletions, which, if you think about it, is sort of what you'd expect. Splay trees modify themselves, even when you're doing look ups, even when you're doing searches. So, they're sometimes called self-adjusting trees for that reason. And it's super simple, but they still have kind of amazing guarantees. And then finally, going beyond the, just the binary tree paradigm many of you might want to look up examples of B trees or also B+ trees. These are very relevant for implementing databases. Here what the idea is, in a given node you're going to have not just one key but many keys and from a node, you have multiple branches that you can take depending where you're searching for falls with respect to the multiple keys that are at that node. The motivation in a database context for going beyond the binary paradigm, is to have a better match up with the memory hierarchy. So, that's also very important, although a little bit out of the scope here. That said, what we discuss about red-black trees, much of the intuition will translate to all of these other balance tree data structures, if you ever find yourself in a position where you need to learn more about them. So, red black trees are just the same as binary search trees, except they also always maintain a number of additional invariants. And so, what I'm going to focus on in this video is, first of all, what the invariants are, and then how the invariants guarantee that the height will be logarithmic. Time permitting, at some point, there will be optional videos more about the guts, more about the implementations of red black trees namely how do you maintain these invariants under insertions and deletions. That's quite a bit more complicated, so that's appropriate for, for optional material. But understanding what the invariants are and what role they play in controlling the height is very accessible, and it's something I think every programmer should know. So, there, I'm going to write down four invariants and really, the bite comes from the second two, okay, from the third and the fourth invariant. The first two invariants you know, are just really cosmetic. So, the first one we're going to store one bit of information additionally at each node, beyond just the key and we're going call this bit as indicating whether it's a red or a black node. You might be wondering, you know, why red black? Well, I asked my colleague, Leo Guibas about that a few years ago. And he told me that when he and Professor Sedgewick were writing up this article the journals were, just had access to a certain kind of new printing technology that allowed very limited color in the printed copies of the journals. And so, they were eager to use it, and so they named the data structure red black, so they could have these nice red and black pictures in the journal article. Unfortunately, there was then some snafu, and at the end of the day, that technology wasn't actually available, so it wasn't actually printed the way they were envisioning it but the name has stuck. So, that's the rather idiosyncratic reason why these data structures got the name that they did, red black trees. So, secondly we're going to maintain the invariant that the roots of the search tree is always black, it can never be red. Okay. So, with the superficial pair of invariants out of the way, let's go to the two main ones. So, first of all, we're never going to allow two reds in a row. By which, I mean, if you have a red node in the search tree, then its children must be black. If you think about for a second, you realize this also implies that if a notice red, and it has a parent, then that parent has to be a black node. So, in that sense, there are no two red nodes in a row anywhere in the tree. And the final invariant which is also rather severe is that every path you might take from a root to a null pointer, passes through exactly the same number of black nodes. So, to be clear on what I mean by a root null path, what you should think about is an unsuccessful search, right? So, what happens in an unsuccessful search, you start at the root depending on whether you need to go smaller or bigger, you go left or right respectably. You keep going left right as appropriate until eventually you hit a null pointer. So, I want you to think about the process that which you start at the root and then, eventually, fall off the end of the tree. In doing so, you traverse some number of nodes. Some of those nodes will be black some of those nodes will be red. And I want you to keep track of the number of black nodes and the constraints that a red black tree, by definition, must satisfy, is that no matter what path you take through the tree starting from the root terminating at a null pointer, the number of black nodes traversed, has to be exactly the same. It cannot depend on the path, it has to be exactly the same on every single root null path. Let's move on to some examples. So, here's a claim. And this is meant to, kind of, whet your appetite for the idea that red black trees must be pretty balanced. They have to have height, basically logarithmic. So, remember, what's the most unbalanced search tree? Well, that's these chains. So, the claim is, even a chain with three nodes can not be a red black tree. So, what's the proof? Well, consider such a search tree. So, maybe, with the key values one, two and three. So, the question that we're asking is, is there a way to color the node, these three nodes, red and black so that all four of the invariants are satisfied. So, we need to color each red or black. Remember, variant two says, the root, the one has to be black. So, we have four possibilities for how to use the color two and three. But really, because of the third invariant, we only have three possibilities. We can't color two and three both red, cuz then we'd have two reds in a row. So, we can either make two red, three black, two black, three red, or both two and three black. And all of the cases are the same. Just to give one example, suppose that we colored the node two, red, and one and three are black. The claim is invariant four has been broken and invariant four is going to be broken no matter how we try to color two and three red and black. What is invariant four says? It says, really on any unsuccessful search, you pass through the same number of black nodes. And so, one unsuccessful search would be, you search for zero. And if you search for a zero, you go to the root, you immediately go left to hit a null pointer. So, you see exactly one black node. Namely one. On the other hand, suppose you searched for four, then you'd start at the root, and you'd go right, and you go to two, you'd go right, and you go to three, you'd go right again, and only then will you get a null pointer. And on that, unsuccessful search, you'd encounter two black nodes, both the one and the three. So, it's a violation of the fourth invariant, therefore, this would not be a red black tree. I'll leave that for you to check, that no matter how you try to code two and three red or black, you're going to break one of the invariants. If they're both red, you'd break the third invariant. If at most one is red, you'd break the fourth invariant. So, that's a non-example of a red-black tree. So, let's look at an example of a red-black tree. One, a search tree where you can actually color the nodes red or black so that all four invariants are maintained. So, one search tree which is very easy to make red black is a perfectly balanced one. So, for example, let's consider this three nodes search tree has the keys three, five, and seven and let's suppose the five is the root. So, it has one child on each side, the three and the seven. So, can this be made a red black tree? So, remember what that question really means. It's asking can we color theses three nodes some combination of red and black so that all four of the invariants are satisfied? If you think about it a little bit, you realize, yeah, you can definitely color these nodes red or black to make and satisfy for the invariants. In particular, suppose we color all three of the nodes, black. We've satisfied variant number one, we've colored all the nodes. We've satisfied variant number two, and particularly, the root is black. We've satisfied invariant number three. There's no reds at all, so there's certainly no two reds in a row. And, if you think about it, we've satisfied invariant four because this tree is perfectly balanced. No matter what you unsuccessfully search for, you're going to encounter two black nodes. If you search for, say, one, you're going to encounter three and five. If you search for, say, six, you're going to encounter five and seven. So, all root null paths have exactly two black nodes and variant number four is also satisfied. So, that's great. But, of course, the whole point of having a binary search tree data structure is you want to be dynamic. You want to accommodate insertions and deletions. Every time you have an insertion or a deletion into a red black tree, you get a new node. Let's say, an insertion, you get a new node, you have to color it something. And now, all of a sudden, you got to worry about breaking one of these four invariants. So, let me just show you some easy cases where you can accommodate insertions without too much work. Time permitting we will include some optional videos with the notion of rotations which do more fundamental restructuring of search trees so that they can maintain the four invariants, and stay nearly perfectly balanced. So, if we have this red black tree where everything's black, and we insert, say, six, that's going to get inserted down here. Now, if we try to color it black, it's no longer going to be a red black tree. And that's because, if we do an unsuccessful search now for, say, 5.5, we're going to encounter three black nodes, where if we do an unsuccessful search for one, we only encounter two black nodes. So, that's not going to work. But the way we can fix it is instead of coloring the six black, we color it red. And now, this six is basically invisible to invariant number four. It doesn't show up in any root null paths. So, because you have two black nodes in all roots in all paths before, before the six was there, that's still true now that you have this red six. So, all four invariants are satisfied once you insert the six and color it red. If we then insert, say, an eight, we can pull exactly the same trick, we can call it an eight red. Again, it doesn't participate in invariant four at all so we haven't broken it. Moreover, we still don't have two reds in a row, so we haven't broken invariant number three either. So, this is yet another red black tree. In fact, this is not the unique way to color the nodes of this search tree, so that it satisfies all four of the invariants. If we, instead, recolor six and eight black, but at the same time, recolor the node seven, red, we're also golden. Clearly, the first three invariants are all satisfied. But also, in pushing the red upward, consolidating the red at six and eight, and putting it at seven instead, we haven't changed the number of black nodes on any given path. Any black, any path that previously went through six, went through seven, anything that went through eight, went through seven so there's exactly the same number of red and black nodes on each such path as there was before. So, all paths still have equal number of black nodes and invariant four remains satisfied. As I said, I've shown you here only simple examples, where you don't have to do much work on an insertion to retain the red black properties. In general, if you keep inserting more and more stuff and certainly if you do the deletions, you have to work much harder to maintain those four invariants. Time permitting, we'll cover just a taste of it in some optional videos. So, what's the point of these seemingly arbitrary four invariants of a red black tree? Well, the whole point is that if you satisfy these four invariants in your search tree, then your height is going to be small. And because your height's going to be small, all your operations are going to be fast. So, let me give you a proof that if a search tree satisfies the four invariants, then it has super small height. In fact, no more than double the absolute minimum that we conceivably have, almost two times log base two of N. So, the formal claim, is that every red-black tree with N nodes, has height O of log N, were precisely in those two times log base two of N + 1. So, here's the proof. And what's clear about this proof is it's very obvious the role played by this invariants three and four. Essentially, what the invariants guarantee is that, a red black tree has to look like a perfectly balanced tree with at most a sort of factor two inflation. So, let's see exactly what I mean. So, let's begin with an observation. And this, this has nothing to do with red black trees. Forget about the colors for a moment, and just think about the structure of binary trees. And let's suppose we have a lower bound on how long root null paths are in the tree. So, for some parameter k, and go ahead and think of k as, like, ten if you want. Suppose we have a tree where if you start from the root, and no matter how it is you navigate left and right, child pointers until you terminate in a null pointer. No matter how you do it, you have no choice but to see at least k nodes along the way. If that hypothesis is satisfied, then if you think about it, the top of this tree has to be totally filled in. So, the top of this tree has to include a perfectly balanced search tree, binary tree of depth k - 1. So, let me draw a picture here of the case of k = three. So, if no matter how you go from the root to a null pointer, you have to see at least three nodes along the way. That means the top three levels of this tree have to be full. So, you have to have the root. It has to have both of its children. It has to have all four of its grandchildren. The proof of this observation is by contradiction. If, in fact, you were missing some nodes in any of these top k levels. We'll that would give you a way of hitting a null pointer seeing less then k nodes. So, what's the point is, the point is this gives us a lower bound on the population of a search tree as a function of the lengths of its root null paths. So, the size N of the tree must include at least the number of nodes in a perfectly balanced tree of depth k - 1 which is 2^k - 1, So, for example, when k = 3, it's 2^3 (two cubed) - 1, or 7 that's just a basic fact about trees, nothing about red black trees. So, let's now combine that with a red black tree invariant to see why red black trees have to have small height. So again, to recap where we got to on the previous slide. The size N, the number of nodes in a tree, is at least 2^k - 1, where k is the fewest number of nodes you will ever see on a root null path. So, let's rewrite this a little bit and let's actually say, instead of having a lower bound on N in terms of k, let's have an upper bound on k in terms of N. So, the length of every root null path, the minimum length of every root null path is bounded above by log base two of quantity N + 1. This is just adding one to both sides and taking the logarithm base two. So, what does this buy us? Well, now, let's start thinking about red black trees. So now, red black tree with N nodes. What does this say? This says that the number of nodes, forget about red or black, just the number of nodes on some root null path has to be the most log base two of N + 1. In the best case, all of those are black. Maybe some of them are red, but in the, in, the maximum case, all of them are black. So, we can write in a red black tree with N nodes, there is a root null path with at most log base two of N + 1, black nodes. This is an even weaker statement than what we just proved. We proved that it have some, somehow must have at most log based two, n + 1 total nodes. So, certainly, that path has the most log base two of N + 1 black nodes. Now, let's, now let's apply the two knockout punches of our two invariants. Alright, so fundamentally, what is the fourth invariant telling us? It's telling us that if we look at a path in our red black tree, we go from the root, we think about, let's say, that's an unsuccessful search, we go down to a null pointer. It says, if we think of the red nodes as invisible, if we don't count them in our tally, then we're only going to see log, basically a logarithmic number of nodes. But when we care about the height of the red black tree, of course, we care about all of the nodes, the red nodes and the black nodes. So, so far we know, that if we only count black nodes then we're good, We only have log base two of N + 1 nodes that we need to count. So, here's where the third invariant comes in. It says, well actually, black nodes are a majority of nodes in the tree. In a strong sense, there are no two reds in a row, on any path. So, if we know the number of black nodes is small, then because you can't have two reds in a row, the number of total nodes on the path is at most twice as large. In the worst case, you have a black route, then red, then black, then red, then black, then red, then black, et cetera. At the worst case, the number of red nodes is equal to the number of black nodes, which doubles the length of the path once you start counting the red nodes as well. And this is exactly what it means for a tree to have a logarithmic depth. So, this, in fact, proves the claim, if the search trees satisfies the invariants one through four, in particular if there's no two reds in a row and all root null paths have an equal number of black nodes, then, knowing nothing else about this search tree, it's got to be almost balanced. It's perfectly balanced up to a factor of two. And again, the point then is that operations in a search tree and the search trees are going to run in logarithmic time, because the height is what governs the running time of those operations. Now, in some sense, I've only told you the easy part which is if it just so happens that your search tree satisfies these four invariants, then you're good. The height is guaranteed to be small so the operations are guaranteed to be fast. Clearly that's exactly what you want from this data structure. But for the poor soul who has to actually implement this data structure, the hard work is maintaining these invariants even as the data structure changes. Remember, the point here is to be dynamic, to accommodate insertions and deletions. And searches and deletions can disrupt these four invariants and then one has to actually change the code to make sure they're satisfied again, so that the tree stays balanced, has low height, even under arbitrary sequences of insertions and deletions. So, we're not going to cover that in this video. It can be done, without significantly slowing down any of the operations. It's pretty tricky, takes some nice ideas. There's a couple well-known algorithms textbooks that cover those details. Or if you look at open source and limitations of balanced search trees, you can look at code that does that implementations. But, because it can be done in a practical way and because Red Black Tree supports such an original array of operations, that's why you will find them used in a number practical applications. That's why balanced search trees should be part of your programmer tool box.