Tag vite :

Databases and web developpment

 Posté par greg le 15 juillet 2010 23:48 |  Les commentaires sont désactivés pour ce post.
Tags : geek  computing  postgresql  dev 

It is not a news that With HTTP, you need a database if you want to store persistant data trough the multiple connections a client makes on your application. Since I have been working in web developpment — more than 10 years now — I have been wondering why the use of databases (relationnal or not) was so different from what I have been taught in school. I could say maybe because the teachers came from what is called «big systems».

Big Systems is an old kingdom where the monarchy is held by proprietary hardware and software. This is because at the begining, the hardware was so slow it was aimed at only one thing : business ! Applications (essentially written in COBOL) were all about displaying business oriented screens on text terminals. All data manipulation was done directly in the database in stored procedures encapsulated in transactions.

Theses Big systems are still running today and most of big companies such as banks — where billions transactions are treated each day — use this kind of systems because there are still no equivalents ! I have heard many projects about porting such applications to Java to fail because of the poor performances on the database side. Night batchs, that used to run for less than 8 hours on big systems, were taking more than 24 hours on multicore RAM boosted PC. And no, sorry, it won’t enhance anything to put several, hundreds, thousands PCs. Database won’t scale and a mainframe would be cheaper ! Million euros/dollars have been wasted in such migration projects and these softwares are still in Cobol & DB2 today.

Web applications started as simple interfaces with simple database. Databases are considered as simple data stores where information is picked up using SQL and processed with PHP or python or Java or whatever. This is perfect from a developper point of view because it has everything in hands and do not need a DBA to design complex queries. This is where MySQL comes from: simple tables with everything related to an ID integer auto_increment. No need for foreign keys, database constraints, transactions etc etc. MyISAM engine outperformed all existing other open source systems. I still have in mind an example from the NASA where, for small applications, teams switched to PHP & MySQL using very simple queries with performance gain and increased maintenability. From this perspective, Document Oriented databases arrive today with advantages on MyISAM : faster and schemaless.

But, as soon as web applications began to grow or even to become the main information system of some companies, lacking features as databases constraints is not acceptable: we need foreign keys, transactions and checks. The reason we need them is very simple. We can afford to have bugs in the application but we can NOT afford to let the database to become inconsistant. As soon as you notice discrepencies in your database, what piece of data can you trust ? Some developpers told me «with tests we can discover bugs early and ovoid to work with database constraints». That may be true but as the database you are designing is to become a main database, it is very likely other pieces of software will come using our database and you can not control what its developpers will do.

Web technologies slowly become mature, but why don’t we learn from the experience the venerable «big systems» developpers have ?

What do we need ?

Let’s go straight in the debate : what part of our application should we implement in the database ?

* The old wisedom said «put everything in stored procedures» * The “web” vision says «use your web language to process data»

What are the pros and the cons for each theory ?

in the database

You have performances and the features the database engine proposes you, no more, no less. Make your query to return everything you want formatted the way you want to keep treatments simples ! This is by example why Views are so useful !
The dark side of this practice is poor maintenability. Triggers by example are a hell to debug. They, with stored procedures, cannot be simply put in a VCS to be deployed, they need special operations to be changed on a database server.

You also need to learn a new language : the procedural language that comes with your database. This means people who are going to work on your application need to know a web language AND a database language (raise your hand out there). This automatically implies you are bound to a specific database with very little hopes to switch one day.

in the code

Implementing all data treatments in your model using methods and layers allows a developper who uses a performant editor to debug problems easily. SQL queries return raw results that often need additional queries and formatting to return a value set that can be manipulated simply. This makes the code more complex but OOP really helps us in this task.

In the other hand, you are conscious implementing such code in a high level (interpreted ?) language is … slow. I do not even talk about memory management with PHP wich ends up with apache processes consuming dozens of Megabytes each !
Furthermore, if other softwares come using your database, they will need to re-implement some of the processes you had in your web application.

Database abstraction is a legend

First of all, do we have to be database agnostic ? The answer fits in one other question: is it possible ? By example, with symfony you have 2 ORMs, Doctrine and Propel. Both aims at giving you an abstract database. You can change the backend without touching at the code. That would be perfect if it worked that way but a query like

SELECT a.title, a.created_at, a.author, count(b.id) FROM blog_post a JOIN blog_comment b ON a.id = b.blog_post_id GROUP BY a.title;

works only for MySQL ! If you give this to Postgresql, Oracle etc etc … it will raise an error saying that all columns in the SELECT must be GROUPed. This means it is possible to write queries that only work with MySQL if you use Propel or Doctrine and as lot of people are using MySQL it is very likely you find plugins that work only with this database.

The old dream about developping with a database and use another database in production is simply foolishness when you know INSERTing an empty string in an INTEGER column works with MySQL on Windows but fails on Linux. If you do so, expect to experience some painful discoveries when deploying on production.

At last, database abstraction means giving you the same reduced set of features whatever the database you use. At the end, everybody ends up using an ORM with MySQL which the poorest choice if you consider the features it provides when you compare it to Postgresql or even sqlite.

MySQL performances are a myth

All myth have a part of truth. MySQL offers really good performances when you use the MyISAM engine. But then, it becomes a simple data store you can query with SQL. InnoDB is the only choice you have as soon as you want to ensure database consistency but then, performances are far from the myth. Do not forget we are not talking about several megabytes databases but gigabytes or petabytes with tables containing million and million rows. We need performances with big volumes.

As a simple experience, I created a SQL file with 100 000 INSERTs on the following table to compare performances on my laptop.

CREATE TABLE test (id BIGINT PRIMARY KEY AUTO_INCREMENT, plop1 VARCHAR(255) NOT NULL, plop2 VARCHAR(255), plop3 INTEGER);

100 000 rows is nothing as I just said we want to deal with Gigabytes but it will give us an idea what we can do or not. Let’s do the tests on Postgresql and Sqlite in parallel. In all cases, logs are deactivated, InnoDB binary logs are deactivated, Postgresql fsync setting is set to OFF to be like it would be in a real production scenario.

 -- Postgresql
CREATE TABLE test (id SERIAL PRIMARY KEY, plop1 VARCHAR(255) NOT NULL, plop2 VARCHAR(255), plop3 INTEGER);

-- Sqlite3
CREATE TABLE test (id BIGINT PRIMARY KEY AUTO INCREMENT, plop1 VARCHAR(255) NOT NULL, plop2 VARCHAR(255), plop3 INTEGER);

Let’s go with the loading time

 # MySQL 5.1.41 with InnoDB table (utf8)
~$ time mysql tmp -u root -b < insert.sql

real    91m40.145s
user    0m8.693s
sys     0m6.228s

# Postgresql 8.4.4 (utf8)
~$ time psql tmp -q -U toto -f insert.sql

real    19m6.401s
user    0m12.573s
sys     0m5.112s

I could not manage to get the time for sqlite (version 3.6.22) because the sqlite3 client didn’t exit after importing the 100 000 rows.
Let’s go with MySQL, only the timing are reported here:

 mysql> SELECT count(*) FROM test;
1 row in set (0.14 sec)

mysql> CREATE TABLE test2 AS SELECT * FROM test;
Query OK, 100000 rows affected (1.29 sec)

mysql> SELECT count(*) FROM test2;
1 row in set (0.00 sec)

mysql> DELETE FROM test2 WHERE id % 3 = 1;
Query OK, 33334 rows affected (1.31 sec)

Huh, something weird here, we should have 33333 rows updated and MySQL finds one more row to match the condition here. ID starts from 1 which is the default for the AUTO_INCREMENT. We will sort that out later. Let’s go on with some funny queries:

SELECT a.id FROM test a LEFT JOIN test2 b ON a.plop1=b.plop1 WHERE b.id IS NULL;
^CCtrl-C -- sending "KILL QUERY 52" to server ...
Ctrl-C -- query aborted.
ERROR 1317 (70100): Query execution was interrupted

The query took more than 1400s (more than 20 minutes).

mysql>  CREATE INDEX plop1_idx ON test (plop1);
Query OK, 100000 rows affected (16.31 sec)

mysql> SELECT a.id FROM test a LEFT JOIN test2 b ON a.plop1=b.plop1 WHERE b.id IS NULL;
^CCtrl-C -- sending "KILL QUERY 52" to server ...
Ctrl-C -- query aborted.

The query took more than 400s

mysql>  CREATE INDEX plop2_idx ON test2 (plop1);
Query OK, 66666 rows affected (1.24 sec)

mysql> SELECT a.id FROM test a LEFT JOIN test2 b ON a.plop1=b.plop1 WHERE b.id IS NULL;
33334 rows in set (7.69 sec)

mysql> UPDATE test2 SET plop2=md5(plop1) WHERE id % 5 = 0;
Query OK, 13333 rows affected (1.41 sec)
Rows matched: 13333  Changed: 13333  Warnings: 0

mysql> SELECT count(a.id) FROM test a JOIN test2 b ON a.plop1 = b.plop1 WHERE b.plop2 IS NOT NULL;
1 row in set (0.88 sec)

mysql> DROP TABLE test2;
Query OK, 0 rows affected (0.01 sec)

mysql> DROP INDEX plop1_idx ON test;
Query OK, 100000 rows affected (5.41 sec)

mysql> DELETE FROM test;
Query OK, 100000 rows affected (3.70 sec)

Let’s now compare with Postgresql

tmp=#  SELECT count(*) FROM test;
Time : 89,992 ms

tmp=# CREATE TABLE test2 AS SELECT * FROM test;
Time : 590,567 ms

tmp=#  SELECT count(*) FROM test2;
Time : 103,141 ms

tmp=# DELETE FROM test2 WHERE id % 3 = 1;
DELETE 33333
Time : 681,639 ms

tmp=# SELECT a.id FROM test a LEFT JOIN test2 b ON a.plop1=b.plop1 WHERE b.id IS NULL;
Time : 928,687 ms

tmp=#  CREATE INDEX plop1_idx ON test (plop1);
CREATE INDEX
Time : 2284,973 ms

tmp=# SELECT a.id FROM test a LEFT JOIN test2 b ON a.plop1=b.plop1 WHERE b.id IS NULL;
Time : 927,227 ms

tmp=#  CREATE INDEX plop2_idx ON test2 (plop1);
CREATE INDEX
Temps : 1429,269 ms

tmp=# SELECT a.id FROM test a LEFT JOIN test2 b ON a.plop1=b.plop1 WHERE b.id IS NULL;
Time : 930,470 ms

tmp=# UPDATE test2 SET plop2=md5(plop1) WHERE id % 5 = 0;
UPDATE 13334
Time : 601,247 ms

tmp=# SELECT count(a.id) FROM test a JOIN test2 b ON a.plop1 = b.plop1 WHERE b.plop2 IS NOT NULL;
Time : 760,900 ms

tmp=# DROP TABLE test2;
DROP TABLE
Time : 61,846 ms

tmp=# DELETE FROM test;
DELETE 100000
Temps : 956,793 ms

and compare it with sqlite

sqlite> SELECT count(*) FROM test;
CPU Time: user 0.024001 sys 0.016001

sqlite> CREATE TABLE test2 AS SELECT * FROM TEST;
CPU Time: user 0.508032 sys 0.176011

sqlite> SELECT count(*) FROM test2;
CPU Time: user 0.008000 sys 0.040003

sqlite> DELETE FROM test2 WHERE id % 3 = 1;
CPU Time: user 0.400025 sys 0.520032

sqlite> CREATE INDEX plop1_idx ON test (plop1);
CPU Time: user 2.440153 sys 1.288081

sqlite> SELECT a.id FROM test a LEFT JOIN test2 b ON a.plop1=b.plop1 WHERE b.id IS NULL;
Error: interrupted (note: less than 2 results per second)

sqlite> CREATE INDEX plop2_idx ON test2 (plop1);
CPU Time: user 1.404088 sys 0.400025

sqlite> SELECT a.id FROM test a LEFT JOIN test2 b ON a.plop1=b.plop1 WHERE b.id IS NULL;
CPU Time: user 2.812175 sys 0.684043

sqlite> UPDATE test2 SET plop2=md5(plop1) WHERE id % 5 = 0;
CPU Time: user 0.000000 sys 0.000000
Error: no such function: md5

sqlite> DROP TABLE test2;
CPU Time: user 0.156010 sys 0.824051

sqlite> DELETE FROM test;
CPU Time: user 0.180011 sys 0.912057

In the examples above, I took 3 open source RDBMS the way they come with my distribution (Ubuntu 10.4), changed some config parameters to suppress log outputs and got the results. MySQL using InnoDB is the slowest database system of the pool and beleive me, peformances keep decreasing when you reach million rows tables. Of course all databases can be fine tuned for performances depending on your data structure, are you trying to convince me we need a DBA ?

Ok, what do we do ?

We need a DBA

Open source web developpment is a strange world. In lot of companies I have been working, people are working with windows because Linux is too complicated, they were using PHP because OO is too complicated and they ran MySQL because Postgresql is too complicated. If you need to build a bridge able to support heavy trucks, you don’t go with the same tools you would use to build a catwalk. This means we need at least someone in the developpment team who knows the database system:

  • he assists the developpment team to create efficient queries
  • he designs the database for performances and scale
  • he sets constraints to keep database consistent
  • he knows the tools the database offers you to implement business processes in it

business processes in the database

Finally ! What should we implement in the database and how ? This question reminds me a long discussion we had about using ENUMs in MySQL1. Some of the group thought ENUMs were good because it sets up a contraint and we do not have to care more than retreiving values from the database or receiving and exception if we tried to save a forbidden value. Others in the group thought ENUMs are a pain to change because of the work it implies during the deployment process. Both of them are true. If I refer to the Database Design Solutions2 an answer would be «use ENUMs for values who will not be subject to change».

We have the point here : we can put in the database the processes that are not supposed to change. Of course you can tell me everything is changing all time but you know you profit from a process in the database you change twice in several years compared to the same process in PHP or python or Java.

Let’s say by example, I manage user authentication in my database. I set up a TRIGGER that encrypts passwords on INSERT or UPDATE and I can create a stored procedure to check if password and login match or not.

  • This is not likely to change over time
  • I have a centralized mechanism that serves all applications using my database

Another example : products labelisation. Coffee can be granted Gold, Silver, Green, Fat Free etc etc labels depending an lot of different criterias. Unless you experience performance issues these treatments will probably be in high level languages because they offer more tools to manage workflows, exceptions etc etc. Postgresql proposes something different as it integrates directly Python or Perl as prodedural languages and it natively manages exceptions. I haven’t tested this solution yet on a real life project but it sounds promising.

Conclusion

As web developpment (and web developpers) becomes more mature, the projects become bigger and bigger. This implies different architectures from what was used and then, a different use of the database. Answers can be found in the Big Systems world where the database is part of the application server. This has several advantages over coding everything in the web server :

  • It is fast
  • It is centralized
  • It is consistent

But this practice makes you project a bit more complex:

  • Deployments
  • Debugging
  • Need new skills

One of the anwser here would be to code in the database everything that is not meant to change. This lowers the cost of deployment and debugging. One more word about debugging, PHP ORMs propose what we call behaviors to trigger processes on the database. They are as hard to debug than real database TRIGGERS and less performant by far.

This is big trouble for MySQL because it brings this database on a field it is not suited: using InnoDB makes it slow and inefficient on big queries. The lack of functionnalities (functions, stored procedures, triggers, partial commit) would make anyone who know about databases not to choose it. When using MyISAM, it is superseded by document oriented databases like MongoDB. Funny to see the NoSQL movement may be seen as NoMySQL.

Document oriented databases are great and I do not think they can replace the whole database stack in big applications. They can instead be used alongside with relationnal databases to handle cases where SQL queries would be too slow. One more time, Postgresql offers a key => value store data type that can be used to create “extensible” rows with data that are never used in searchs. With the 9.0 version coming, it will allow replication as simple as it is with mysql but it offers more performances and more features.

Big critical applications are the next challenge in the web developpment and postgresql is probably the golden choice for theses huge projects to come.

1 The ENUM problem occures in MySQL because it is defined on a column basis. In Postgresql, enums are user defined types that can be used in many columns you want. Just change the type and all the columns are updated.

2 Database Design Solution By Rod Stefens (Wrox)

Liens récents 

 Crew - code review for Git projects.

Marqué par greg le 6 mai 2012 07:57..
Tags: geek computing git

Suivre le lien

 responsive grid system

Marqué par greg le 5 mai 2012 12:38..
Tags: geek computing dev

Suivre le lien

 Vim "improved

Marqué par greg le 8 mars 2012 09:26..
Tags: geek computing vim

Suivre le lien

 Highly scalable NoSql blog

Marqué par greg le 3 mars 2012 09:12..
Tags: geek computing dev

Suivre le lien

 Bruce Momjian Pg blog

Marqué par greg le 16 février 2012 13:51..
Tags: geek computing postgresql net blog

Suivre le lien

 Richard Miller's blog

Marqué par greg le 16 février 2012 09:57..
Tags: geek computing dev php symfony net blog

Suivre le lien

 Online SQL explain interface

Marqué par greg le 15 février 2012 09:58..
Tags: geek computing postgresql net

Suivre le lien

 Ace online code editor

Marqué par greg le 15 février 2012 08:57..
Tags: net geek computing dev javascript

Suivre le lien

 Coding horror

Marqué par greg le 14 février 2012 12:42..
Tags: geek computing dev net blog

Suivre le lien

 telegraphCQ - Stream oriented database

Marqué par greg le 13 février 2012 14:44..
Tags: geek computing postgresql

Suivre le lien