Tag finder:

Web developers, what can Postgresql do for you ?

 Posted by greg on March 22, 2011 10:29 PM |  Comments are deactivated for this post
Tags : geek  computing  postgresql 

Where does it come from ?

Why is there so mush fuss about the fact Postgresql is the most advanced free database system ? Because its code has roots deep in the 20th century, Postgresql has the same ancestor Sybase, MSQL also have: Ingres. This means being a database designed for professional has always been the idea behind the creator of Ingres and Post-ingres before it went open source. Postgresql has greatly evolved in 15 years of open source development and offers today many great features that ease web development. I propose a little tour of what Postgresql has to offer.

The structure is more important than the code

Well this quote is not from me but from Linus Torvald saying «I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.». He’is not the only one to think data structure is what makes the code simple, even if I am not an example, I do and so does Eric S.Raymond («Smart data structures and dumb code works alot better than the other way around.»).

This is especially true in web development where you have to mix several layers and technologies: javascript – PHPSQL. Mastering a database is the art of making PHP code simple and concentrate the complexity in the database. This has been rules for 20 years and still is in the COBOL and MVS world. Managing the model layer in the database is hard to maintain and to version. This is why web development took the complete opposite direction with trying to manage everything in the PHP layer and consider the database as a simple data pool: id => row. But PHP is a way slower than a database as it comes to data computing and this tends to create very messy code.

The database is not the kingdom of DBAs, they are here to allow developers to work

What about saying «let’s put in the database business rules that we know they are not going to change» ? Do you change your slugification mechanism often ? Do you change the user password encryption system often ? No ? Why then not setting them in the databse where it’s fast while making your PHP code clearer ? In fact, modern framework and ORMs have developed a «behavior» mechanism that hook with your code to automatically perform some task that help you. From my point of view, this solution has very often the downsides of both worlds: it’s slow and hard to maintain and debug. So the question is now

h1 What features of Postgresql make my PHP code clearer ?

Checks, constraints and transactions.

Data consistency is the main word of a relational database. Imagine you mess 5% of your code, that would be a very bad day of course but you know you will be able to get the situation as it was before. Imagine now you mess 5% of your data. How to make the difference between good and bad data ? Data consistency is so critical that managing it in the application layer means you will never accept other applications to use your database, fearing a bad piece of code to pollute it all slowly.
Big databases are used by many applications and thus consistency checks must be implemented inside the database to ovoid pollution. That would be even better if the system tells the developers when something goes wrong as soon as it is possible.

Check and constraints

Most of you know column constraints like NOT NULL or UNIQUE. The database will refuse the insertion or update if the row does not comply with the columns specifications. This feature is so important it is implemented by all relationnal database today. With Postgresql, you can also set multi-columns constraints known as checks:

db=> CREATE TABLE meteorological (
db(> station_id INTEGER NOT NULL,
db(> probed_day date NOT NULL,
db(> temp_min NUMERIC(3,1) NOT NULL,
db(> temp_max NUMERIC(3,1) NOT NULL,
db(> rainfall INTEGER NOT NULL DEFAULT 0,
db(> PRIMARY KEY (station_id, probed_at),
db(> FOREIGN KEY (station_id) REFERENCES station (id),
db(> CHECK rainfall >= 0,
db(> CHECK (temp_max > temp_min)
db(> );
CREATE TABLE

On the example above, all rows have constraints, they must be filled with value except the rainfall that default to 0 if no value is provided. Still this column can not be forced to null. The PRIMARY KEY contraints create a UNIQUE index on the couple (station_id, probed_day) to ensure we record one sample per station per day. The _station_id_ column has to reference an existing id in the station table. The CHECK clauses at the end also add extra constraints: rainfall must be positive or zero (which is different than null) and maximum temperature must be greater than minimal temperature in the same day. This is a very powerful table constraint as you can ensure the logical consistency of your data:

db(> CONSTRAINT email_address CHECK (email ~* '^([^@s]+)@((?:[-a-z0-9]+.)+[a-z]{2,})$'),

Better, if you provide a function that checks if an email address is valid:

db(> CONSTRAINT email_address CHECK (check_email(email)),
db(> CONSTRAINT credit_card_nr_check CHECK (check_credit_card(credit_card_nr)),
db(> CONSTRAINT valid_slug CHECK (check_slug(slug)),

Put your own example in the comments ;)

Transactions

Transactions are a way to ensure your actions are atomic from a database point of view. They exist in almost all other RDBMS but may not be as powerful as they are in Postgresql.

db=> BEGIN;
BEGIN
db=> ALTER TABLE ....
ALTER TABLE
db=> INSERT INTO ...
INSERT
db=> COMMIT;
COMMIT

Your actions only take place for others as soon as you enter the final COMMIT. If there is something wrong (error) Postrgresql is smart enough to tell you it cancels all actions until you ROLLBACK the current transaction. Transactions are so efficient, easy and powerful, you should consider entering a transaction everytime you have some maintainance work to do:

postgres=# BEGIN;
BEGIN
postgres=# ALTER DATABASE plop RENAME TO plop_bak;
ALTER DATABASE
postgres=# ALTER DATABASE plop_new RENAME TO plop;
ALTER DATABASE
postgres=# COMMIT;
COMMIT

If something goes wrong, your actions are canceled and you can ROLLBACK the whole thing without your users suspecting anything. But, as a developer you might need nested transactions to be able to rollback a part of a transaction. Nested transaction are not supported by Postgresql but it supports savepoint that permit you to ROLLBACK your transaction until a certain point.

$db->do('START');

try {
    // do a lot of things in the database here
} catch (SomeException $e) {
    $db->do('ROLLBACK'); // We cancel the actions database is as nothing happened
    exit;
}

$db->do("SAVEPOINT plop"); // Set a savepoint

try {
    // do some other stuff in the database here
} catch (SomeException $e) {
    $db->do('ROLLBACK TO SAVEPOINT plop'); // We cancel the last changes but keep what was made before the savepoint
    $mail->send('Something weird occure but changes are saved');
}

$db->do('COMMIT');

Applications of this pattern are immense in the web if you consider the number of query blocks you need to build a statistic or a reputation system.

Schemas

PHP developers are not used to namespaces. It has been supported only since PHP 5.3 and many developers do not use them yet. Namespaces are good and make your code cleaner while ovoiding name collisions. This is why Postgresql implements such feature (since version 7.3) named schemas.

A schema is a namespace where you can define all database objects you want: tables, sequences, views, functions etc… You can of course use the database wihtout bothering about schemas existence but you can really benefit from this features especially if you write bundles (or plugins). Using schemas is as simple as prefixing object names with the schema name:

db=> CREATE SCHEMA my_schema;
CREATE SCHEMA
db=> CREATE TABLE my_schema.my_table (id SERIAL PRIMARY KEY, some_data VARCHAR NOT NULL);
CREATE TABLE
=> SELECT * FROM my_schema.my_table;
id | some_data
---+----------
(0 rows)

By default the schema you are using is called public this is where your objects are stored when you do not specify a schema name. This can be changed by overriding the _search_path_ parameter:

=> SET search_path TO my_schema;
SET
=> SELECT * FROM my_table;
id | some_data
---+----------
(0 rows)

You can even cascade schemas by setting the order you want to look trough them as layers:

=> SET search_path TO my_schema, public;
SET

When doing this, if you do not precise the name of the schema when creating objects, they will be in the public schema. But when you query your objects, if two objects with the same name exist in each layer, the object of the first layer will be used overriding the objects in the layers after. This is useful if you want to create functions everybody can use (in CHECK clauses by example) you would set them in the public schema.

Sequences

Sequences are special counters that ensure all different database users obtain a different number. This is very useful for generating primary keys when they are INTEGER because they need to be unique but it can be useful to use them for other purposes. By the way, defining a simple INTEGER that needs to be auto generated by a sequence is declared as type SERIAL.

Sort by rank

Let’s consider the following table:

db=> CREATE TABLE favorite_links (id SERIAL PRIMARY KEY, url VARCHAR NOT NULL, title VARCHAR NOT NULL, rank SERIAL NOT NULL);
CREATE TABLE

You can check sequences have been created for you entering d:

db=> d
...
public | favorite_links_id_seq   | sequence | greg
public | favorite_links_rank_seq | sequence | greg
...

A SELECT with an ORDER BY on the rank column will sort the list of the links properly and you don’t have to worry about managing the rank, it is handled automatically. What about if you want to re order the list setting the row having the id $id to rank $rank ?

db=> BEGIN;
BEGIN
db=> UPDATE favorite_links SET rank = rank + 1 WHERE rank >= $rank;
UPDATE
db=> UPDATE favorite_links SET rank = $rank WHERE id = $id;
UPDATE 
db=> COMMIT;
COMMIT

Awsome types: Arrays, key-value stores and nested structures

That’s one greatest strenght of Postgresql, it has an extensible and growing set of types and operators. You can have a look to the basic types that include awsome date and time, geometrical, network addresses types with each time smart operators to deal with them. With Postgresql, you can also define arrays of any of theses types.

Arrays

Let’s imagine we manage a diving center. We need to know where is each air tank but also when it has been check by a specialized security center.

db=> CREATE TABLE air_tank (
db(> reference CHAR(5) PRIMARY KEY,
db(> state INTEGER NOT NULL,
db(> security_check dates[],
db(> is_nitrox BOOLEAN NOT NULL DEFAULT false );
CREATE TABLE

The «check_dates» column is an array containing the dates the tank has been performed a safety pressure test. We have to ensure every tank has a test at least every two years and we do the test every since months. As soon as I know the next test date, I can ask for air tank that would go over 2 years if I wait for the next safety test.

db=> INSERT INTO air_tank (reference, state, security_check, is_nitrox) VALUES
db-> ('SCB01', 1, ARRAY['2009-01-23'::date], false),
db-> ('SCB25', 1, ARRAY['2009-07-12'::date], false),
db-> ('ANP07', 1, ARRAY['2010-10-30'::date], true);
INSERT

Let’s say we know the next safety test will be performed today, we can easily select the tanks that will be more than 2 years old six months after this date:

db=> SELECT reference FROM air_tank WHERE (now() - '18 months'::interval > ALL(security_checks));
SCB25
SCB01

Once the tanks have been tested and certified, we can update the rows to add our new check at the end of the dates array.

db=> UPDATE air_tank SET security_checks = security_checks || now()::date WHERE (now() - '18 months'::interval > ALL(security_checks)); 
UPDATE

Key – value stores

If you install Postgresql’s contrib package, you will be able to install the Hstore extension. This allows the use of a key -> value array type in your tables and perform search on it. This is a killing feature when you need to have an extensible set of data in your tables. It allows PHP programmers to save and retreive associative arrays directly from the database. It is of course possible to query the database on the values stored in the HStore field.

Note: The HSTORE type is an extension of postgresql that comes with the postgresl-contrib package on Debian/Ubuntu. You need to source the file /usr/share/postgresql/8.4/contrib/hstore.sql as superuser to install it in your database.

CREATE TABLE reminder (slug VARCHAR PRIMARY KEY, remind_date TIMESTAMP NOT NULL, content TEXT, stat_markers HSTORE);
CREATE TABLE
db=> INSERT INTO reminder (slug, remind_date, content, stat_markers) VALUES ('think-about-it', now(), 'I have to think about this reminder.', '"sd" => 4, "viewed" => 5139, "scope" => "all" ');
INSERT
db=> SELECT count(slug) FROM reminder WHERE stat_markers ? 'scope';
1 row
db=> SELECT count(slug) FROM reminder WHERE stat_markers -> 'scope' = 'all';
1 row
db=> SELECT count(slug) FROM reminder WHERE (stat_markers -> 'viewed')::integer > 5000;
1 row
db=> SELECT key, SUM(value) AS total_viewed FROM (SELECT 'viewed'::text AS key, (stat_markers->'viewed')::integer AS value FROM reminder ) AS stat GROUP BY key;
  key   | total_viewed 
--------+--------------
 viewed |        5139

Nested structure … or not: materialized paths.

As you work with computers, you can’t miss hierarchical structure since it is present on your hard drive since the last 30 years. The web came and the first thing developers found to tidy contents or products is to fit them into a hierarchical set of categories. Even if they didn’t invent it, Google made a big jump with using tags in their mail system. Tags have an advantage on categories in the fact a content can have several tags. It is the same concept as a file would be in several directories on your hard drive.

Plain tags have downsides, they are hard to maintain over time and it requires an effort to keep them consistent. The tag system often ends up with tons of unused tags pointing on obsolete contents. A good way to categorize in an efficient manner a huge quantity of data would be to use hierarchical tags.

computing – linux – kernel – module
computing – editor – vim

A content tagged with “kernel” would automatically be tagged “linux” and “computing”. If a content is tagged “vim”, it would also be tagged with “computing” and “editor”. A search on “computing” would retreive our 2 contents.

Note: The LTREE type is an extension of postgresql that comes with the postgresl-contrib package on Debian/Ubuntu. You need to source the file /usr/share/postgresql/8.4/contrib/ltree.sql as superuser to install it in your database.

db=> CREATE TABLE raw_content(slug VARCHAR PRIMARY KEY, title VARCHAR NOT NULL, content TEXT, tags ltree[] NOT NULL, is_deleted boolean NOT NULL DEFAULT false);
CREATE

We have here a “content” table that has a field named “tags” which is an array of ltree.

db=> INSERT INTO raw_content VALUES ('this-is-a-content', 'This is a content', 'lorem ipsum', ARRAY['computing.editor.vim'::ltree]), ('This-is-another-content', 'This is another content', 'lorem ipsum', ARRAY['computing.linux.kernel'::ltree, 'computing.network'::ltree]);
CREATE
db=> SELECT slug FROM content WHERE '*.linux.*' ~ ANY(tags);
          slug           
-------------------------
 This-is-another-content
(1 row)

Rules, Function and Triggers: the behavior approach.

Modern ORMs in PHP all offer a feature named “behavior”. A behavior is a way to hook some code that will be triggered on some actions. Most of the time theses behaviors have no added value to be in PHP but make the code more complicated.

The rules system: soft delete

It is possible with postgresql to rewrite queries on the fly. Theses rewrites rules are called RULES. Creating a VIEW by exemple is exactly the same thing as creating an empty table and a rule that rewrite your select on the view to the select that define the view.

db=> CREATE VIEW content AS SELECT slug, content, tags FROM raw_content WHERE NOT is_deleted;
CREATE VIEW

Internally, this is exactly the same thing as creating an empty table “content” with the asked fields and a rule that rewrite SELECT queries on the content table to the SELECT that define the view. Now we just have to define new rules to be able to soft DELETE rows:

CREATE RULE "soft_delete" AS ON DELETE TO content DO INSTEAD UPDATE raw_content SET is_deleted = true WHERE slug=OLD.slug;
CREATE RULE
db=> SELECT count(slug) FROM content;
count
-----
   2
db=> DELETE FROM content WHERE '*.network.*' ~ ANY(tags);
DELETE
db=> SELECT count(slug) FROM content;
count
-----
   1
db=> SELECT count(slug) FROM raw_content;
count
-----
   2

The documentation is pretty complete and is a must read for anyone who want to dive into this very powerful feature.

Triggers

Let’s take by example the slugification, it automagically adds a new column named “slug” to your structure and will feed it with the filtered value entered in another field (most often a title). They are used to create a significant URL. Such behaviors use 99% of the time the same algorithme and therefor should be implemented at a database level to speed up processing.

db=> CREATE FUNCTION title_slugify_trig() RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
    NEW.slug := trim(both '-' from regexp_replace(lower(transliterate(NEW.title)), '[^a-z0-9]+', '-', 'g'));

    RETURN NEW;
END;
$$;
CREATE FUNCTION
db=> CREATE TRIGGER raw_content_slug_trigger
    BEFORE INSERT ON raw_content
    FOR EACH ROW
    EXECUTE PROCEDURE title_slugify_trig();
CREATE TRIGGER
db=> INSERT INTO raw_content (slug, title, content, tags) VALUES ('', 'Another BIG title.', 'Ipsum Lorem', ARRAY['computing.database.postgresql':ltree]);
INSERT
db=> SELECT * FROM content WHERE '*.postgresql.*' ~ ANY(tags);
       slug        |       title        |      content      |              tags
-------------------+--------------------+-------------------+---------------------------------
 another-big-title | Another BIG title. | Ipsum Lorem       | {computing.database.postgresql}
(1 row)

PommBundle, the way to use Postgresql and Symfony2

Of course they are heaps of features I didn’t mention here — like real table inheritance, user defined types and more — but I just can advise you to dive into the superbe Postgresql’s documentation.

The question is now how to take advantage of these tremendous feature using a modern PHP framework. Most of the ORMs I have used implement a database abstraction layer that reduce the range of accessible features as the smallest subset all databases propose. As I have been working with Symfony2 since the begining of the year, I have developed the PommBundle that is a PHP O®M specialized for Postgresql. This will be the subject of the next article on this blog, take care.

Recent Links 

 Bruce Momjian Pg blog

Bookmarked by greg on February 16, 2012 1:51 PM..
Tags: geek computing postgresql net blog

Follow link

 Richard Miller's blog

Bookmarked by greg on February 16, 2012 9:57 AM..
Tags: geek computing dev php symfony net blog

Follow link

 Online SQL explain interface

Bookmarked by greg on February 15, 2012 9:58 AM..
Tags: geek computing postgresql net

Follow link

 Ace online code editor

Bookmarked by greg on February 15, 2012 8:57 AM..
Tags: net geek computing dev javascript

Follow link

 Coding horror

Bookmarked by greg on February 14, 2012 12:42 PM..
Tags: geek computing dev net blog

Follow link

 telegraphCQ - Stream oriented database

Bookmarked by greg on February 13, 2012 2:44 PM..
Tags: geek computing postgresql

Follow link

 Relational Database Technologies

Bookmarked by greg on February 13, 2012 1:28 PM..
Tags: geek computing postgresql net blog

Follow link

 J.Davis blog «Ideas on Databases»

Bookmarked by greg on February 12, 2012 8:59 PM..
Tags: geek computing postgresql net blog

Follow link

 Flotr2, javascript graph library

Bookmarked by greg on February 6, 2012 8:17 AM..
Tags: geek computing dev javascript

Follow link

 Node.js toolbox

Bookmarked by greg on January 31, 2012 10:09 AM..
Tags: geek computing dev javascript nodejs

Follow link