The writings of Merlin Moncure, professional database developer, about work, life, family, and everything else.

Sunday, December 16, 2007

Managing Trees with Arrays, Part 2

Einhverfr writes:
Out of curiosity, have you looked at contrib/tablefunc and its connectby() function? We use it and the parent_id solution in LSMB and it works pretty well. It also allows for a cleaner semantics because we don't have to store the full path in the record (this and "level" are both generated by connectby!).

Thanks for your interest.

I have looked at connectby, and I have a lot of issues with it. It's a good approach and probably better if your data is highly volatile (updates are much cheaper), but less flexible. If it meets your requirements perfectly I'd use it, but I've tried a few times and end up bumping into limitations. The materialized path approach as advertised scales extremely well, except for the insert problem or if your trees are extremely deep because the index gets huge.

First, the planner knows zero about the connectby function. This can lead to planning problems for complicated joins that it interfaces with. The materialized path approach is mostly wide open to the planner. I find myself often writing complex queries, and absolutely detest pushing data issues into the application layer. This is why I favor views over functions generally when there is a choice. The view abstracts the interface a bit for application without hobbing the planner. The view abstraction approach I used is what really sells the idea, btw.

Secondly, connectby takes a single key as a parameter only. This can lead to very awkward situations if you need to generalize the input over a range of records (give me all the families for items in range of (a,b)), and the query will be slow because the table is unordered. Since the materialized path approach orders the table over the key, the query is simple and fast.

Third, the indexing strategies are different...all operations traversing the tree with the parent/child index relationship require recursion. The connectby function hides the recursion from you but the recursion is still there. This means that for many operations the net result will be less efficient than when the path is handed to you.

Fourth, while it's kind of a pain because you have to create an opclass (and other reasons), the array approach at least opens the door to composite keys.

Fifth, connectby gives you a delimited text string, which has to be parsed and wrapped in a view (my examples above do this already). You also need to wrap if you want to be able to traverse up the tree in a single operation. I would advise doing this...it's probably not a good idea to have the application invoke the connectby function directly.

In fact, sacrificing a little bit of cpu overhead, I could write the parent child relationship abstracted into a few recursive sql functions that traverse the tree. While this can get dangerous as you only want to do really heavy recursion in C as iteration, it's probably good enough for most circumstances and sidesteps some of the land mines the planner throws you with connectby.

Thanks again for commenting. I really should write some benchmarks demonstrating the various techniques. Maybe, if there is a enough interest, I'll take the time to do that.

merlin

2 comments:

Einhverfr said...

A few more points about how we use connectby() in LedgerSMB.

We have a menu_node table which is used to generate a navigation tree. The table has several elements including a label, an id, a parent_id, and a position. The primary key is the parent_id and the position. Joins exist to two other tables for purpose of menu generation (a menu_acl table and a menu_attribute table).

The planner doesn't have to know anything about connect_by because the function is joined once on the menu_node table. I would of course avoid joining on the output of connectby() however.

We use two different approaches for different interfaces using connectby. The first is a tree return, and the second returns only immediate children of a specific node.

I suppose that in the end, both approaches have limitations. connectby() allows for more flexibility with regard to data management but as you point out there are some issues with data volatility in some applications.

However, I am looking forward to support for WITH RECURSIVE ((currently slated for 8.4).

Einhverfr said...

Oops-- correcting a few errors:

There are some indexing issues when data is volatile, but these are pretty minor.

Also, there is no reason that this has to be done recursively. One *could* rework the system to do a scan the key, build a tree out of it and then do a full scan (presumably index scan) for all relevant subnodes, and so forth. This would naturally prejudice the performance away from small return sets to those of large sets.

The large issue as you say is with current generations of the system and where performance is critical. Moving somethng like this to a planner-transparent statement would allow for a lot better optimization.