The writings of Merlin Moncure, professional database developer, about work, life, family, and everything else.

Monday, December 17, 2007

Advisory Locks, an Update

Update: more testing has shown this to not work as well as I thought...details will follow
Update 2: this doesn't work at all. I need to give it some more thought...M

I've made no secret that 'advisory locks' is one one of the coolest features in PostgreSQL (see my posts here and here). As a general purpose database mutex, you can do all sorts of wonderful things with them...long term locks, cooperative locks, sub transaction locks, etc.

But what about sub statement locks?

Consider the gapless sequence as described here (normal sequences produce 'holes' in the number line when rollbacks or database crashes occur). In the example, an external record is used to track the counter and provide concurrency protection. Well, how about this:

create table t(id int primary key);

insert into t select coalesce(id, 1) from
select pg_advisory_lock(1),
id + 1 as id,
pg_advisory_unlock(1) from t
order by id desc limit 1
) q;

The above statement (ab)uses the left to right order of execution of the fields of the statement. No external object is locked, and the outer select statement could do a bunch of other time consuming things without tying up the lock and killing concurrency. There is no chance of two transactions grabbing the same value of id in read committed mode (there may be a small chance of a race to grab the first value of id). This may not be the best or even a good way to do this, but it's an interesting demonstration of the technique.

Sunday, December 16, 2007

12 Days of Christmas, PostgreSQL style

In case you have trouble remembering the lyrics...

select v || case
when v=1 then ' partridge in a pear tree'
when v=2 then ' turtle doves'
when v=3 then ' french hens'
when v=4 then ' calling birds'
when v=5 then ' GO-O-OLDEN RINGS!'
when v=6 then ' swans a swimming'
when v=7 then ' geese a laying'
when v=8 then ' maids a milking'
when v=9 then ' ladies dancing'
when v=10 then ' lords a leaping'
when v=11 then ' pipers piping'
when v=12 then ' drummers drumming'
end from
(select generate_series(generate_series(1,12), 1, -1) as v) q;

Happy Holidays!

[update: now the lyrics are in the right order for day #2 and beyond...also decided to correctly spell 'pear' against my better judgement]

Managing Trees with Arrays, Part 2

Einhverfr writes:
Out of curiosity, have you looked at contrib/tablefunc and its connectby() function? We use it and the parent_id solution in LSMB and it works pretty well. It also allows for a cleaner semantics because we don't have to store the full path in the record (this and "level" are both generated by connectby!).

Thanks for your interest.

I have looked at connectby, and I have a lot of issues with it. It's a good approach and probably better if your data is highly volatile (updates are much cheaper), but less flexible. If it meets your requirements perfectly I'd use it, but I've tried a few times and end up bumping into limitations. The materialized path approach as advertised scales extremely well, except for the insert problem or if your trees are extremely deep because the index gets huge.

First, the planner knows zero about the connectby function. This can lead to planning problems for complicated joins that it interfaces with. The materialized path approach is mostly wide open to the planner. I find myself often writing complex queries, and absolutely detest pushing data issues into the application layer. This is why I favor views over functions generally when there is a choice. The view abstracts the interface a bit for application without hobbing the planner. The view abstraction approach I used is what really sells the idea, btw.

Secondly, connectby takes a single key as a parameter only. This can lead to very awkward situations if you need to generalize the input over a range of records (give me all the families for items in range of (a,b)), and the query will be slow because the table is unordered. Since the materialized path approach orders the table over the key, the query is simple and fast.

Third, the indexing strategies are different...all operations traversing the tree with the parent/child index relationship require recursion. The connectby function hides the recursion from you but the recursion is still there. This means that for many operations the net result will be less efficient than when the path is handed to you.

Fourth, while it's kind of a pain because you have to create an opclass (and other reasons), the array approach at least opens the door to composite keys.

Fifth, connectby gives you a delimited text string, which has to be parsed and wrapped in a view (my examples above do this already). You also need to wrap if you want to be able to traverse up the tree in a single operation. I would advise doing's probably not a good idea to have the application invoke the connectby function directly.

In fact, sacrificing a little bit of cpu overhead, I could write the parent child relationship abstracted into a few recursive sql functions that traverse the tree. While this can get dangerous as you only want to do really heavy recursion in C as iteration, it's probably good enough for most circumstances and sidesteps some of the land mines the planner throws you with connectby.

Thanks again for commenting. I really should write some benchmarks demonstrating the various techniques. Maybe, if there is a enough interest, I'll take the time to do that.


Friday, October 19, 2007

A better psql with less

psql is a great tool, but not very good at browsing data. Or is it? The following settings will make psql much more usable.

in your user profile:
export PAGER=less
export LESS="-iMSx4 -FX"

in your .psqlrc (make it if it's not already there)

\pset pager always

try it!! now wide query results will paginate much better and searching is much easier.

update: fixed quotes for LESS, thanks depesz

Thursday, September 06, 2007

Managing Trees with Arrays

One of my favorite problems in databases is working with recursive structures. In particular, I've made it my quest to debunk the prevailing myth that recursive structures are not 'relational' and thus should be handled in application code or some alternative format. One strategy in this war is demonstrating how easy it is to deal with recursive structures with a little thought and some clever queries. What follows is a demonstration of how one might use the arrays of integers in PostgreSQL in a variation of the classic 'materialized path' method of organizing data in trees, along with some scaffolding to reduce the complexity from the point of view of the client application. This method is efficient, scalable, and useful for many distributions of data. Since trees usually store heterogeneous elements, we will make sure to allow for that in the examples.

-- 8.3 brings enum support. in 8.2, we use a foreign key or a check constraint
create type item_type as enum('root', 'foo', 'bar');

-- the classic way
create table item
item_id serial primary key,
parent_id int references item(item_id) on delete cascade
item_type item_type
create index item_parent_id_idx on item(parent_id);

insert into item values
(default, null, 'root'),
(default, 1, 'foo'),
(default, 1, 'foo'),
(default, 3, 'bar'); -- etc

While this is very elegant and expressive, the major problem is that to pull any data out of the system based on hierarchy requires recursion, either in the application or in a stored procedure framework. This scales poorly for large sets, especially with deep structures. Let's see how things might look using arrays to materialize the path to the parent:

-- arrays ftw
create table item
item_id serial primary key,
parents int[], -- one minor disadvantage is losing the RI constraint
item_type item_type

create index item_parents_idx on item(parents);

insert into item values
(default, '{1}', 'root'),
(default, '{1, 2}', 'foo'),
(default, '{1, 3}', 'foo'),
(default, '{1, 3, 4}', 'bar');

Here, we materialize the path (series of parents) to the item in an array, including the item's id. Looking up the family for the 'bar' element (id #4), is done by:

select p.* from item p
select explode_array(parents) as item_id
from item where item_id = 4
) q using (item_id);

Let's go ahead and generalize this into a view:

create view item_tree as
select q.item_lookup, p.* from item p
select item_id as item_lookup,
explode_array(parents) as item_id
from item
) q using (item_id);

We can use this to specifically look at item #4 by:

select item_id, type from item_tree where item_lookup = 4;
item_id | item_type
1 | root
3 | foo
4 | bar

That's pretty nifty. The above query will use the index on item_id for the id lookup as well as when it goes back to item to fetch the elements. Let's expand our view to include child items of the item of interest, and add a flag that can be queried in case that's desirable (this is where the array is particularly helpful):

create view item_tree as
select q.item_lookup, is_child, p.* from item p
item_id as item_lookup,
false as is_child,
explode_array(parents) as item_id
from item
union all
l.item_id as item_lookup,
true as is_child,
from item l
join item r on r.parents between (l.parents || 0)
and (l.parents || 2147483647)
) q using (item_id);

The above query creates a remarkably efficient plan and postgresql is smart enough to optimize the case where you are only interested in child or non-child items. The array term is properly utilizes index when necessary. Well, how do we add elements to the structure? While arrays are neat, dealing with them on the client can be a pain (all those strings) and it would be nice to not have to construct them on the fly. Let's expand our view to do this for us:

create view item_tree as
p.parents[array_upper(p.parents, 1) - 1] as parent_id, -- null on zero reference
p.* from item p

item_id as item_lookup,
false as is_child,
explode_array(parents) as item_id
from item
union all
l.item_id as item_lookup,
true as is_child,
from item l
join item r on r.parents between (l.parents || 0)
and (l.parents || 2147483647)
) q using (item_id);

create or replace rule insert_item_tree as on insert to item_tree do instead
insert into item (item_id, parents, item_type)
coalesce(new.item_id, nextval('item_item_id_seq')),
(select i.parents || coalesce(new.item_id, currval('item_item_id_seq'))::int from item i where item_id = new.parent_id),

insert into item_tree(item_id, parent_id, item_type) values (null, 4, 'bar'); -- we can pass in null to get the sequence to assign a value
insert into item_tree(item_id, parent_id, item_type) values (null, 5, 'foo'); -- other columns of view are ignored

select * from item_tree where item_lookup = 3;
item_lookup | parent_id | is_child | item_id | parents | item_type
3 | | f | 1 | {1} | root
3 | 1 | f | 3 | {1,3} | foo
3 | 3 | t | 4 | {1,3,4} | bar
3 | 4 | t | 5 | {1,3,4,5} | bar
3 | 5 | t | 6 | {1,3,4,5,6} | foo

This is a functional example that could be used to build real applications. Many portions are left as an exercise to the reader, including performance testing (it's pretty good), extending the base item table into properties and specific subtables, a more robust constraint system, and better error handling. Lately I am coming to the perspective that is better to try and preserve sql-ish interface to application facing structures as opposed to providing an API to build recursive structures that the application must interact with. While this is a bit more difficult and involves some hacky things, the complexity is neatly tucked away and the user is free to build recursive structures using familiar paradigms (insert, select, etc).


Tuesday, September 04, 2007

PostgreSQL 8.3 Features: Plan Invalidation

As previously stated, PostgreSQL 8.3 is shaping up to be a great release of the software. One of the biggest beneficiaries of the new feature set are pl/pgsql developers. 8.3 stands to be the best thing to happen to pl/pgsql since 8.0 brought dollar quoting to the table...before which some might argue serious development with the language bordered on the impractical. The major new features are vastly improved cursor handling (including UPDATE/DELETE), improved handling for set returning functions, arrays of composite types (for efficient passing of data to/from functions), and, especially, automatic invalidation of cached plans. While plan invalidation solves many tangential issues their greatest impact will certainly be in server side development of stored procedures.

When a pl/pgsql function executes for the first time in a session, the server 'compiles' it by parsing the static (not passed through EXECUTE) sql and generating plans for all the queries. This is an essential mechanism for fast repeated execution of server side functions because it allows many tedious, cpu intensive portions of query execution to be optimized. One part of this optimization involves looking up various database objects involved in the query and storing identifiers that are internal to the database. While this is useful and good, it has an unfortunate side effect in that if the structures the database were referencing in the plan were no longer valid, the function plan itself is no longer valid and may raise errors if executed. There are various workarounds to address this that are mostly obsolete (plan invalidation doesn't catch references to functions yet).

Here is the canonical example (and, arguably, the justification for the effort) of demonstrating how plan invalidation works:

create or replace function test_func() returns void as
create temp table tt as select 'hello world!'::text;
perform * from tt;
drop table tt;
$$ language plpgsql;

select test_func();
select test_func();

ERROR: relation with OID 87831 does not exist

The first invocation of test_func() generates a plan for the function. The second time around, the plan is pointing to a stale reference to 'tt' and the function fails. In PostgreSQL 8.3, this function succeeds without error. The database determines when tt is dropped that there are plans referencing it and throws them out. Historically, this was a huge impediment to pl/pgsql development because there was no particularly safe way to create and drop temporary tables from within a function. It is natural to want to create a table for temporary storage, do all kinds of things to it, and release it when done -- but sadly was not generally possible without various workarounds. For this reason, historically it was often better to use cursors inside pl/pgsql functions for holding working data which pushes code into a more procedural style. Now there is room for more set oriented style data processing which should be an attractive alternative to any dba.

Monday, September 03, 2007

PostgreSQL 8.3 Features: Arrays of Compound Types

This will be the first in what hopefully be a series of entries describing the new and interesting features in PostgreSQL 8.3. The database is entering in the final stages of the development cycle before going into beta with all the major patches having been submitted weeks ago (with one possible stunning exception, lazy xid assignment described here. Thus, aside from the hard decisions that are going to have to be made on the HOT patch (more on that later), 8.3 is shaping up nicely. On a scale of one to ten, this release is a strong eight -- and if all the outstanding patches get in -- a vicious '10'. Excited yet? If not, let me prime the pump with my favorite new feature, Arrays of Composite Types.

David Fetter laid the groundwork for this new feature and was ultimately accepted with the help from many others, including PostgreSQL core developers Tom Lane and Andrew Dunstan. The feature combines two neat features of PostgreSQL, the ability to build complex types out of Plain Old Data Types and arrays (follow the links for an overview of both features).
create table foo(id int, b text);
create table bar(id int, foos foo[]); -- create an array of the type 'foo' in bar
create table baz(id int, bars bar[]); -- create an array of the type 'bar' in baz

insert into foo values(1, 'abc');
insert into foo values(2, 'def');
insert into bar values(1, array(select foo from foo)); -- there are two elements in this array
insert into bar values(2, array(select foo from foo)); -- and this
insert into bar values(3, array(select foo from foo)); -- and this
insert into baz values(1, array(select bar from bar)); -- three elements in here

select ((bars[1]).foos[2]).b from baz; -- use parenthesis to get elements out of composites
postgres=# select ((bars[1]).foos[2]).b from baz;
(1 row)
Being able to nest composite types in array structures is a powerful feature. While abusive in the extreme to good normalization and relational principles it allows
flexible approaches to some difficult problems:

  • Input Data to Functions: pl/sql and pg/pgsql functions have some limitations in how you can get data into the function. There are various workarounds for this that all have application in specific circumstances, but composite arrays allow you to pass a heap of data to a function that doesn't happen to be in a table. While this could have scalability issues for large arrays (> 10k elements), it can be convenient.
  • Sometimes, Arrays are Just Better. Although this is rarely a good idea, sometimes its beneficial to store data in arrays and the ability to nest complex structures is helpful in this regard. Just remember that in most cases the entire array has to be read from and written back to disk (the server only understands the array as a field and treats it as such).
  • Targeted Denormalization: When passing data between functions or to the client, it may be efficient to accumulate a particular column into an array to reduce duplication of data in the other columns. We may now do this with composites:

drop table bar, baz cascade;       -- we would rarely rely directly on the generated types
alter table foo add column data text default 'data';
insert into foo values (1, 'ghi'); -- make some duplicated on foo by id
insert into foo values (2, 'jlk');
create type foo_data as (b text, data text);
select id, array_accum ((b, data)::foo_data) from foo group by id order by id;

id | array_accum
1 | {"(abc,data)","(ghi,data)"}
2 | {"(def,data)","(jlk,data)"}

Arrays perform best as a server side feature. Because it's only currently recommended to render them to the client in text, Arrays of Composites introduce some interesting parsing issues going back and forth from the server. While this is ok in limited cases like the above example, it's inefficient and error prone to build and properly escape a nested array structure. While PostgreSQL has the ability to transfer arrays in binary over the protocol, there is no API to access the custom format data on the client side. Until that changes, I expect this feature will be most useful to stored procedure developers.

While relatively short on substance, I hope that this writing provides a good introduction to this interesting feature. Some of you will probably seize on this feature for it's utility in solving a narrow, but important class of problems. It's one more weapon in the mighty arsenal of the PostgreSQL developer who is already armed with the most practical, versatile, and generally kick-ass piece of software for solving the world's data management issues.

Friday, August 31, 2007

Dell md1000 bonnie++ benchmarks

Following are the results of our testing of the Dell MD1000 with two raid controllers (Perc 5/e). Server was dell 2950, 8gb ram. Pay close attention to the seeks figure, as this important figure for databases (the sequential rates are not all that important). We tried various combinations of software raid and hardware raid in both active/active and active passive configurations.



Sequential OutputSequential InputRandom
DescriptionSize:Chunk SizeOutputRewriteInput
K/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPU/ sec
single drive perc 5i ext2 wal16G566005320884903826444.9
dual controller 5 drive hardware raid 0 striped in software ext316G3617907517794833368836311115.2
singe controller 10 drive raid 0 hardware xfs16G4056024710953715510528341111.8
single controller 5 drive hardware raid 0 ext316G1624773411020319423957261062.0
dual controller 5 drive hardware raid 0 mirrored in software xfs16G2556603011053118379800321471.4
single controller 10 drive hardware raid 10 xfs16G2990393610469714402601271460.9
dual controller 10 drive hardware jbod software raid 0 xfs16G2430362512691316793798521081.5
dual controller 10 drive hardware jbod software raid 1016G244808279041012355593241509.9
RUN1 dual controller 5 drive raid 0 xfs separate volumes dual bonnie16G287469369540013423656311246.3
RUN1 single controller 5 drive raid 0 xfs dual bonnie same volume16G140861215368381291449506.6
single controller 5 drive raid 5 xfs16G292219348342511262651171282.4
RUN 1 dual controller 2x4 drive hardware raid 516G24525931836471122956116767.0
RUN 216G248950327447210261117181025.7
dual controller 2x5 drive hardware raid 5 striped in software16G423363448040911195092121063.6
dual controller 2x5 drive hardware raid 5 mirrored in software16G409208265154759755375.0