EECS 311: Trees |
Home Course Info Links Grades |
Lectures Newsgroup Homework Exams |
There's a lot of terminology associated with trees. You should be familiar with the following basic terms:
There are also many kinds of trees, including:
This method is easy to understand and implement. It's very useful for certain kinds of tree applications, such as heaps, and fairly useless for others. It's typically used on dense binary trees.
The idea is simple:
Three simple formulae allow you to go from the index of the parent to the index of its children and vice versa:
The advantage of the linear representation is this easy traversal up and down, and efficient use of space if the tree is complete. The disadvantage is inefficient use of space if the tree is sparse.
Again, the idea is simple. A node in the tree has
The most important thing to remember about the linked representation is this:
A tree is represented by the pointer to the root node, not a node.
The empty tree is simply the NULL pointer, not an empty node.
Traversing a binary tree in any of the three orders, even in a linked representation without parent links, is trivial with recursion.
For example, here's the pseudo-code for inorder traversal:
inorderTraversal(bt, fn): if (bt != NULL) inorderTraversal(bt->leftChild) fn(bt->info) inorderTraversal(bt->rightChild)
Here,
bt
is a binary tree, i.e., a pointer to a binary
tree nodefn
is a function that operates on the kind of
data stored in the treebt->leftChild
, bt->rightChild
,
and bt->info
access the left child pointer, right
child pointer, and data field, respectivelyIt should be obvious how to code the other two traversals.
Binary search trees have the property that
Consider the combinatorial complexity of the data structures we've seen so far. We give best and worse case Big O values:
Add data | Retrieve data | |
---|---|---|
Array (unordered) | constant (just add to end) | O(N) (linear search) |
Array (ordered) | O(log2N) compares, O(N) data transfers (shift data) | O(log2N) (binary search) |
Linked list (unordered) | constant (assuming end pointer) | O(N) (linear search) |
Linked list (ordered) | O(log2N) compares, constant data transfers (change a few pointers), O(N) nodes visited sequentially | O(log2N) (binary search), visit O(N) nodes |
Binary search tree (best case) | O(log2N) compares, constant data transfers, O(log2N) nodes visited | O(log2N) |
Binary search tree (worst case) | O(N) compares, constant data transfers, O(N) nodes visited | O(N) |
On the average, then, a binary search tree will be
Binary search tree algorithms for adding and retrieving are also very simple.
In the worst case, however, a binary search tree will be as bad as a linked list. Many of the variations of binary search trees that we'll see will be attempts to get the best of both worlds: fast access and fast storage, albeit using more complex algorithms.
It's easy to add new data and "grow" the tree:
addData(&bt, x): if (bt == NULL) bt = new BtNode bt->info = x bt->left = bt->right = NULL else if (x < bt->info) addData(bt->left, x) else if (x > bt->info) addData(bt->right, x) else // already in tree
Adding a new node takes O(log2n) steps, where n is the number of nodes in the tree. This is best and average case behavior. If the tree is very unbalanced, e.g., we passed it a sorted list of numbers, then adding a new node will take O(n2) steps.
Deleting data is complicated when the data being removed is not in a leaf node. We can't just delete the node, because then our tree would "fall apart." We have to promote one of the children to become the new parent. The child has to be
There are at most two possible candidates:
It doesn't matter which one we pick. If neither subtree exists, we have a leaf node, which can be just deleted and it's pointer removed from the parent node.
The following algorithm deletes a node in a binary search tree:
Note that
Consider removing TALBOT from the tree in Figure 7.18. If we pick SELIGER, then we have to move SEFTON to where SELIGER was.
A heap is a binary tree (not a binary search tree) with the following properties:
Full and dense means that only the bottom level of the tree has gaps, and the gaps are all on the right.
Being dense means the linear array representation is the most appropriate. A linked representation would waste space.
A heap is a great data structure for implementing a priority queue, because
It also turns out we can use the routines for adding and removing items to create an N log2N sort algorithm called heap sort.
The algorithm for adding data to a heap is simple, with one unintuitve aspect: we add an item to the bottom first then move it upwards, if necessary:
while the data is larger than its parent, swap the item with its parent, and repeat, checking the data item with its new parent
Walking up the tree will take at most log2N steps (compares and data transfers), because the tree is always balanced, so the height of the tree is log2N.
Both uses of heaps (priority queues and heap sort) only need to delete the root, so that's the only case we'll consider here. Deleting is somewhat similar to adding in that we'll replaced the deleted root with another item from the tree and bubble it down. The algorithm is made only slightly more complex because we have two children to worry about instead of just one parent.
The new item is almost certainly in the wrong place because it's one of the smaller data elements, but this keeps the heap full and dense.
Note that
Walking down the tree will take at most log2N steps (compares and data transfers). We may have to do three comparisons (parent with each child and child against child) but 3 log2N is still O(log2N).
As mentioned above, heaps are best implemented using a linear array. The move up and down the tree are simple jumps around the array and the array is mostly filled. The only downside is that data has to be moved around, unless, of course, what we really have is an array of pointers to data, which is the best way to go with non-numeric data.
There are two versions of heap sort that I know about. Both
The versions differ on how the first phase is done.
The first phase takes no more than O(N log2N) swaps and compares, because each walk down takes log2N steps (N starting small and growing to the number of elements) and it's called N times. In fact, it's actually better than O(N log N), it's O(N), although proving this requires some mathematics.
Similarly, the second phase takes O(N log2N) swaps and compares for the same reason. Ths is both best and worst case behavior. In the best case, you have a heap when you start and no swaps are needed for the first phase, but they become necessary in the second phase.
The first phase takes O(N log2N) swaps and compares, because each walkUp takes log2N steps and it's called N times.
Version 2 is simpler, especially if you've already implemented walking up and down. Version 1, however, only needs walking down, and Phase 1 of Version 1 is faster (O(N) instead of O(N log N)).
Comments? Send mail to c-riesbeck@northwestern.edu.