So, in this video, we'll graduate beyond
the domain of just vanilla binary search
trees, like we've been talking about
before, and we'll start talking about
balanced binary search trees.
These are the search trees you'd really
want to use when you want to have real
time guarantees on your operation time.
Cuz they're search trees which are
guaranteed to stay balanced, which means
the height is guaranteed to stay
logarithmic, which means all of the
operations search trees support that we
know and love, will also be a logarithmic
in the number of keys that they're
storing.
So, let's just quickly recap.
What is the basic structure tree property?
It should be the case that at every single
node of your search tree, if you go to the
left, you'll only see keys that are
smaller than where you started and if you
go to the right you only see keys that
are bigger than where you started.
And a really important observation, which
is that, given a set of keys, there's
going to be lot and lots of legitimate,
valid, binary search trees with those
keys.
So, we've been having these running
examples where the keys one, two, three,
four, five.
On the one hand, you can have a nice and
balanced search tree that has height only
two, with the keys one through five.
On the other hand, you can also have these
crazy chains, basically devolved to link
lists where the heights for, and elements
could be as high as N - 1.
So, in general, you could have an
exponential difference in the height.
It can be as small, in the best case, as
logarithmic and as big, in the worst case,
as linear.
So, this obviously motivates search trees
that have the additional property that you
never have to worry about their height.
You know they're going to be well
balanced.
You know they're going to have height
logarithmic.
You're never worried about them having
this really lousy linear height.
Remember, why it's so important to have a
small height?
It's because the running time of all of
the operations of search trees depends on
the height.
You want to do search, you want to
insertions, you want to find predecessors
or whatever, the height is going to be
what governs the running time of all those
properties.
So, the high level idea behind balanced search
trees is really exactly what you think,
which is that, you know, because the
height can't be any better than
logarithmic in the number of things you're
storing, that's because the trees are
binary so the number of nodes can only
double each level so you need a
logarithmic number of levels to
accommodate everything that you are
storing.
But it's got to be logarithmic, lets make
sure it stays logarithmic all the time,
even as we do insertions and deletions.
If we can do that, then we get a very rich
collection of supported operations all
running in logarithmic time.
As usual, n denotes, the number of
keys being stored in the tree.
There are many, many, many different
balanced search trees.
They're not super, most of them are not
super different from each other.
I'm going to talk about one of the more
popular ones which are called Red Black
Trees.
So, these were invented back in the '70s.
These were not the first balanced binary
search tree data structures, that honor
belongs to AVL trees, which again are not
very different from red black trees,
though the invariants are slightly
different.
Another thing you might want to look up
and read about is a very cool data
structure called splay trees, due to
Sleator and Tarjan,
These, unlike red black trees and AVL
trees, which only are modified on
insertions and deletions, which, if you
think about it, is sort of what you'd
expect.
Splay trees modify themselves, even when
you're doing look ups, even when you're
doing searches.
So, they're sometimes called
self-adjusting trees for that reason.
And it's super simple, but they still have
kind of amazing guarantees.
And then finally, going beyond the, just
the binary tree paradigm many of you might
want to look up examples of B trees or
also B+ trees.
These are very relevant for implementing
databases.
Here what the idea is, in a given node
you're going to have not just one key but
many keys and from a node, you have
multiple branches that you can take
depending where you're searching for falls
with respect to the multiple keys that are
at that node.
The motivation in a database context for
going beyond the binary paradigm, is to
have a better match up with the memory
hierarchy.
So, that's also very important, although a
little bit out of the scope here.
That said, what we discuss about red-black
trees, much of the intuition will
translate to all of these other balance
tree data structures, if you ever find
yourself in a position where you need to
learn more about them.
So, red black trees are just the same as
binary search trees, except they also
always maintain a number of additional
invariants.
And so, what I'm going to focus on in this
video is, first of all, what the
invariants are, and then how the
invariants guarantee that the height will
be logarithmic.
Time permitting, at some point, there will
be optional videos more about the guts,
more about the implementations of red
black trees namely how do you maintain
these invariants under insertions and
deletions.
That's quite a bit more complicated, so
that's appropriate for, for optional
material.
But understanding what the invariants are
and what role they play in controlling the
height is very accessible, and it's
something I think every programmer should
know.
So, there, I'm going to write down four
invariants and really, the bite comes from
the second two, okay, from the third and
the fourth invariant.
The first two invariants you know, are
just really cosmetic.
So, the first one we're going to store one
bit of information additionally at each
node, beyond just the key and we're going
call this bit as indicating whether it's a
red or a black node.
You might be wondering, you know, why red
black?
Well, I asked my colleague, Leo Guibas
about that a few years ago.
And he told me that when he and Professor
Sedgewick were writing up this article the
journals were, just had access to a
certain kind of new printing technology
that allowed very limited color in the
printed copies of the journals.
And so, they were eager to use it, and so
they named the data structure red black,
so they could have these nice red and
black pictures in the journal article.
Unfortunately, there was then some snafu,
and at the end of the day, that technology
wasn't actually available, so it wasn't
actually printed the way they were
envisioning it but the name has stuck.
So, that's the rather idiosyncratic reason
why these data structures got the name
that they did, red black trees.
So, secondly we're going to maintain the
invariant that the roots of the search
tree is always black, it can never be red.
Okay.
So, with the superficial pair of
invariants out of the way, let's go to the
two main ones.
So, first of all, we're never going to
allow two reds in a row.
By which, I mean, if you have a red node
in the search tree, then its children must
be black.
If you think about for a second, you
realize this also implies that if a notice
red, and it has a parent, then that
parent has to be a black node.
So, in that sense, there are no two red
nodes in a row anywhere in the tree.
And the final invariant which is also
rather severe is that every path you might
take from a root to a null pointer, passes
through exactly the same number of black
nodes.
So, to be clear on what I mean by a root
null path, what you should think about is an
unsuccessful search, right?
So, what happens in an unsuccessful
search, you start at the root depending on
whether you need to go smaller or bigger,
you go left or right respectably.
You keep going left right as appropriate
until eventually you hit a null pointer.
So, I want you to think about the process
that which you start at the root and then,
eventually, fall off the end of the tree.
In doing so, you traverse some number of
nodes.
Some of those nodes will be black some of
those nodes will be red.
And I want you to keep track of the number
of black nodes and the constraints that a
red black tree, by definition, must
satisfy, is that no matter what path you
take through the tree starting from the
root terminating at a null pointer, the
number of black nodes traversed, has to be
exactly the same.
It cannot depend on the path, it has to be
exactly the same on every single root null path.
Let's move on to some examples.
So, here's a claim.
And this is meant to, kind of, whet your
appetite for the idea that red black trees
must be pretty balanced.
They have to have height, basically
logarithmic.
So, remember, what's the most unbalanced
search tree?
Well, that's these chains.
So, the claim is, even a chain with three
nodes can not be a red black tree.
So, what's the proof?
Well, consider such a search tree.
So, maybe, with the key values one, two and three.
So, the question that we're asking is, is
there a way to color the node, these three
nodes, red and black so that all four of
the invariants are satisfied.
So, we need to color each red or black.
Remember, variant two says, the root, the
one has to be black.
So, we have four possibilities for how to
use the color two and three.
But really, because of the third
invariant, we only have three possibilities.
We can't color two and three both red, cuz
then we'd have two reds in a row.
So, we can either make two red, three
black, two black, three red, or both two
and three black.
And all of the cases are the same.
Just to give one example, suppose that we
colored the node two, red, and one and
three are black.
The claim is invariant four has been
broken and invariant four is going to be
broken no matter how we try to color two
and three red and black.
What is invariant four says?
It says, really on any unsuccessful
search, you pass through the same number
of black nodes.
And so, one unsuccessful search would be,
you search for zero.
And if you search for a zero, you go to
the root, you immediately go left to hit a
null pointer.
So, you see exactly one black node.
Namely one.
On the other hand, suppose you searched
for four, then you'd start at the root,
and you'd go right, and you go to two,
you'd go right, and you go to three, you'd
go right again, and only then will you get
a null pointer.
And on that, unsuccessful search, you'd
encounter two black nodes, both the one
and the three.
So, it's a violation of the fourth
invariant, therefore, this would not be a
red black tree.
I'll leave that for you to check, that no
matter how you try to code two and three
red or black, you're going to break one of
the invariants.
If they're both red, you'd break the third
invariant.
If at most one is red, you'd break the
fourth invariant.
So, that's a non-example of a red-black tree.
So, let's look at an example of a red-black tree.
One, a search tree where you can actually
color the nodes red or black so that all
four invariants are maintained.
So, one search tree which is very easy to
make red black is a perfectly balanced one.
So, for example, let's consider this three
nodes search tree has the keys three,
five, and seven and let's suppose the five
is the root.
So, it has one child on each side, the
three and the seven.
So, can this be made a red black tree?
So, remember what that question really
means.
It's asking can we color theses three
nodes some combination of red and black so
that all four of the invariants are
satisfied?
If you think about it a little bit, you
realize, yeah, you can definitely color
these nodes red or black to make and
satisfy for the invariants.
In particular, suppose we color all three
of the nodes, black.
We've satisfied variant number one, we've
colored all the nodes.
We've satisfied variant number two, and
particularly, the root is black.
We've satisfied invariant number three.
There's no reds at all, so there's
certainly no two reds in a row.
And, if you think about it, we've
satisfied invariant four because this tree
is perfectly balanced.
No matter what you unsuccessfully search
for, you're going to encounter two black
nodes.
If you search for, say, one, you're going
to encounter three and five.
If you search for, say, six, you're going
to encounter five and seven.
So, all root null paths have exactly
two black nodes and variant number four is
also satisfied.
So, that's great.
But, of course, the whole point of having
a binary search tree data structure is you
want to be dynamic.
You want to accommodate insertions and
deletions.
Every time you have an insertion or a
deletion into a red black tree, you get a
new node.
Let's say, an insertion, you get a new
node, you have to color it something.
And now, all of a sudden, you got to worry
about breaking one of these four
invariants.
So, let me just show you some easy cases
where you can accommodate insertions
without too much work.
Time permitting we will include some
optional videos with the notion of
rotations which do more fundamental
restructuring of search trees so that they
can maintain the four invariants, and stay
nearly perfectly balanced.
So, if we have this red black tree where
everything's black, and we insert, say,
six, that's going to get inserted down
here.
Now, if we try to color it black, it's no
longer going to be a red black tree.
And that's because, if we do an
unsuccessful search now for, say, 5.5,
we're going to encounter three black
nodes, where if we do an unsuccessful
search for one, we only encounter two
black nodes.
So, that's not going to work.
But the way we can fix it is instead of
coloring the six black, we color it red.
And now, this six is basically invisible
to invariant number four.
It doesn't show up in any root null
paths.
So, because you have two black nodes in
all roots in all paths before, before the
six was there, that's still true now that
you have this red six.
So, all four invariants are satisfied once
you insert the six and color it red.
If we then insert, say, an eight, we can
pull exactly the same trick, we can call
it an eight red.
Again, it doesn't participate in invariant
four at all so we haven't broken it.
Moreover, we still don't have two reds in
a row, so we haven't broken invariant
number three either.
So, this is yet another red black tree.
In fact, this is not the unique way to
color the nodes of this search tree, so
that it satisfies all four of the
invariants.
If we, instead, recolor six and eight
black, but at the same time, recolor the
node seven, red, we're also golden.
Clearly, the first three invariants are
all satisfied.
But also, in pushing the red upward,
consolidating the red at six and eight,
and putting it at seven instead, we
haven't changed the number of black nodes
on any given path.
Any black, any path that previously went
through six, went through seven, anything
that went through eight, went through
seven so there's exactly the same number
of red and black nodes on each such path
as there was before.
So, all paths still have equal number of
black nodes and invariant four remains
satisfied.
As I said, I've shown you here only simple
examples, where you don't have to do much
work on an insertion to retain the red
black properties.
In general, if you keep inserting more and
more stuff and certainly if you do the
deletions, you have to work much harder to
maintain those four invariants.
Time permitting, we'll cover just a taste
of it in some optional videos.
So, what's the point of these seemingly
arbitrary four invariants of a red black
tree?
Well, the whole point is that if you
satisfy these four invariants in your
search tree, then your height is going to
be small.
And because your height's going to be
small, all your operations are going to be
fast.
So, let me give you a proof that if a
search tree satisfies the four invariants,
then it has super small height.
In fact, no more than double the absolute
minimum that we conceivably have, almost
two times log base two of N.
So, the formal claim, is that every
red-black tree with N nodes, has height O
of log N, were precisely in those two
times log base two of N + 1.
So, here's the proof.
And what's clear about this proof is it's
very obvious the role played by this
invariants three and four.
Essentially, what the invariants guarantee
is that, a red black tree has to look like
a perfectly balanced tree with at most a
sort of factor two inflation.
So, let's see exactly what I mean.
So, let's begin with an observation.
And this, this has nothing to do with red
black trees.
Forget about the colors for a moment, and
just think about the structure of binary
trees.
And let's suppose we have a lower bound on
how long root null paths are in the tree.
So, for some parameter k, and go ahead and
think of k as, like, ten if you want.
Suppose we have a tree where if you start
from the root, and no matter how it is you
navigate left and right, child pointers
until you terminate in a null pointer.
No matter how you do it, you have no
choice but to see at least k nodes along
the way.
If that hypothesis is satisfied, then if
you think about it, the top of this tree
has to be totally filled in.
So, the top of this tree has to include a
perfectly balanced search tree, binary
tree of depth k - 1.
So, let me draw a picture here of the case
of k = three.
So, if no matter how you go from the root
to a null pointer, you have to see at
least three nodes along the way.
That means the top three levels of this
tree have to be full.
So, you have to have the root.
It has to have both of its children.
It has to have all four of its
grandchildren.
The proof of this observation is by
contradiction.
If, in fact, you were missing some nodes
in any of these top k levels.
We'll that would give you a way of hitting
a null pointer seeing less then k nodes.
So, what's the point is, the point is this
gives us a lower bound on the population
of a search tree as a function of the
lengths of its root null paths.
So, the size N of the tree must include at
least the number of nodes in a perfectly
balanced tree of depth k - 1 which is
2^k - 1, So, for example,
when k = 3, it's 2^3 (two cubed) - 1,
or 7 that's just a basic fact about trees,
nothing about red black trees.
So, let's now combine that with a red
black tree invariant to see why red black
trees have to have small height.
So again, to recap where we got to on the
previous slide.
The size N, the number of nodes in a tree,
is at least 2^k - 1, where k is
the fewest number of nodes you will ever
see on a root null path.
So, let's rewrite this a little bit and
let's actually say, instead of having a
lower bound on N in terms of k, let's have
an upper bound on k in terms of N.
So, the length of every root null path,
the minimum length of every root null
path is bounded above by log base two of
quantity N + 1.
This is just adding one to both sides and
taking the logarithm base two.
So, what does this buy us?
Well, now, let's start thinking about red
black trees.
So now, red black tree with N nodes.
What does this say?
This says that the number of nodes, forget
about red or black, just the number of
nodes on some root null path has to be the
most log base two of N + 1.
In the best case, all of those are black.
Maybe some of them are red, but in the,
in, the maximum case, all of them are
black.
So, we can write in a red black tree with
N nodes, there is a root null path with at
most log base two of N + 1, black nodes.
This is an even weaker statement than what
we just proved.
We proved that it have some, somehow must
have at most log based two, n + 1 total
nodes.
So, certainly, that path has the most log
base two of N + 1 black nodes.
Now, let's, now let's apply the two
knockout punches of our two invariants.
Alright, so fundamentally, what is the
fourth invariant telling us?
It's telling us that if we look at a path
in our red black tree, we go from the
root, we think about, let's say, that's an
unsuccessful search, we go down to a null
pointer.
It says, if we think of the red nodes as
invisible, if we don't count them in our
tally, then we're only going to see log,
basically a logarithmic number of nodes.
But when we care about the height of the
red black tree, of course, we care about
all of the nodes, the red nodes and the
black nodes.
So, so far we know, that if we only count
black nodes then we're good, We only have
log base two of N + 1 nodes that we need
to count.
So, here's where the third invariant comes
in.
It says, well actually, black nodes are a
majority of nodes in the tree.
In a strong sense, there are no two reds
in a row, on any path.
So, if we know the number of black nodes
is small, then because you can't have two
reds in a row, the number of total nodes
on the path is at most twice as large.
In the worst case, you have a black route,
then red, then black, then red, then
black, then red, then black, et cetera.
At the worst case, the number of red nodes
is equal to the number of black nodes,
which doubles the length of the path once
you start counting the red nodes as well.
And this is exactly what it means for a
tree to have a logarithmic depth.
So, this, in fact, proves the claim, if
the search trees satisfies the invariants
one through four, in particular if there's
no two reds in a row and all root null
paths have an equal number of black nodes,
then, knowing nothing else about this
search tree, it's got to be almost
balanced.
It's perfectly balanced up to a factor of
two.
And again, the point then is that
operations in a search tree and the search
trees are going to run in logarithmic
time, because the height is what governs
the running time of those operations.
Now, in some sense, I've only told you the
easy part which is if it just so happens
that your search tree satisfies these
four invariants, then you're good.
The height is guaranteed to be small so
the operations are guaranteed to be fast.
Clearly that's exactly what you want from
this data structure.
But for the poor soul who has to actually
implement this data structure, the hard
work is maintaining these invariants even
as the data structure changes.
Remember, the point here is to be dynamic,
to accommodate insertions and deletions.
And searches and deletions can disrupt
these four invariants and then one has to
actually change the code to make sure
they're satisfied again, so that the tree
stays balanced, has low height, even under
arbitrary sequences of insertions and
deletions.
So, we're not going to cover that in this
video.
It can be done, without significantly
slowing down any of the operations.
It's pretty tricky, takes some nice ideas.
There's a couple well-known algorithms
textbooks that cover those details.
Or if you look at open source and
limitations of balanced search trees, you
can look at code that does that
implementations.
But, because it can be done in a practical
way and because Red Black Tree supports
such an original array of operations,
that's why you will find them used in a
number practical applications.
That's why balanced search trees should be
part of your programmer tool box.