LWN.net Weekly Edition for May 16, 2013

PostgreSQL 9.3 beta: Federated databases and more

LWN.net Weekly Edition for May 9, 2013

(Nearly) full tickless operation in 3.10

Weekly edition	Kernel	Security	Distributions	Contact Us	Search
Archives	Calendar	Subscribe	Write for LWN	LWN.net FAQ	Sponsors

Clustering, development, and galactic conquest at PGCon 2012

May 31, 2012

This article was contributed by Josh Berkus

Every year, the PostgreSQL community has a developer meeting in Ottawa called PGCon. This year's PGCon was the largest yet, with 210 contributors and users attending. As usual, though, the most interesting developments happened outside the regular conference sessions. This year's events included new replication and clustering projects, changes to the PostgreSQL development process, and a battle for galactic conquest.

PostgresXC 1.0 beta

There were two major announcements made at the one-day Postgres Clustering Summit before the main conference. The PostgresXC project announced its 1.0 beta release, and 2ndQuadrant announced its plans for a new replication method for PostgreSQL. The Clustering Summit, a meeting of developers working on replication and database clustering for PostgreSQL, was sponsored by NTT Open Source of Japan.

PostgresXC is a project to build a write-scalable cluster for PostgreSQL. A PostgresXC cluster is intended to be a co-located group of four to twenty servers, and is designed to be able to handle larger numbers of database connections and concurrent write queries than a single server. The goals of the project are to build something which satisfies a lot of the same use cases as Oracle Real Application Clusters (RAC), Oracle's database clustering technology. PostgresXC implements a shared-nothing architecture instead of RAC's shared-storage architecture in order to survive a larger variety of hardware failure situations.

As of the 1.0 beta release, PostgresXC supports most PostgreSQL syntax and functionality for a small cluster of database servers so you can theoretically deploy most PostgreSQL-based applications on PostgresXC. Notable limitations with 1.0 beta are the lack of triggers and savepoints, as well as the inability to add new nodes without rebuilding the cluster. These issues will be addressed in 1.1 and later versions.

One of the primary differences between regular PostgreSQL and PostgresXC is the notion of distributed tables. Each table you create can be distributed across the cluster based on round-robin, a hash, or a column value, instead of being replicated to all cluster nodes. This allows the database cluster to support more data than any single node can.

The PostgresXC project primarily consists of developers from NTT Open Source and EnterpriseDB, and it is licensed under The PostgreSQL license. Once PostgresXC is more stable, expect rapid adoption due to the needs of users for bigger, more fault-tolerant database servers.

New "logical" replication project

The second replication announcement was 2ndQuadrant's plan to create a new replication technology for PostgreSQL, which they call "Bi-Directional Replication", mostly to distinguish it from other forms of replication. They have secured sponsorships from companies like McAfee, have begun work on the patches, and plan to get the feature done for PostgreSQL 9.3 in 2013. 2ndQuadrant is a PostgreSQL consulting company headquartered in the United Kingdom, lead by committer Simon Riggs.

At an evening pizza session, they explained how the new replication will work. PostgreSQL's streaming binary replication, introduced in version 9.0, allows a replication server to receive copies of data pages over a normal PostgreSQL database connection (usually port 5432). This connection is called the "walsender" and its associated listener process on the replica is called the "walreceiver" (WAL stands for Write-Ahead Log). This works well for whole-database replication, but does not permit users to replicate only specific parts of the database, or to run any kind of write queries on the replicas. As a result, other types of replication are also needed.

The new feature will allow clients to choose to receive either replicated query statements or changed rows over the same walsender connection, so that listeners can filter or modify the replication stream. More importantly, it would allow a replica to create database changes as well as receiving them, provided that conflicts were handled by database or middleware code, thus implementing one form of multi-master replication for PostgreSQL.

This type of replication configuration is already possible with replication tools such as Slony-I, Londiste, and Bucardo. However, all of those tools suffer from administration and performance overhead which many users find unacceptable because they are implemented in Postgres user space instead of in the Postgres backend. Bi-Directional Replication is expected to be more efficient, and thus useful to a wider array of users. It may even become the underlying data transport mechanism for future versions of the older replication systems.

Bi-Directional Replication is also intended to solve a different use-case than PostgresXC. While PostgresXC is designed to provide scalability and consistency for database writes within a single data center, the new replication is intended to support asynchronous, geographically distributed replication, at the expense of consistency. Such "eventual consistency" systems have become popular among web startups recently, and PostgreSQL users are no exception.

CommitFests and the PostgreSQL development process

One of the other meetings at PGCon was the annual Developer Meeting, at which 20 of the largest code contributors to PostgreSQL meet in person and discuss current and future development of the database. At the Developer Meeting three years ago, the PostgreSQL project adopted the CommitFest model for managing patch review and annual development. At this one, the group reviewed how the CommitFest process was going, and decided to make some minor changes to try to improve it.

The central idea of the CommitFest model is to completely clear out the queue of pending patches every two months, in order to prevent catastrophic levels of unreviewed patch buildup. CommitFests have allowed the project to accept a large number of patch submissions and make sure that everything submitted got a serious review. At the developer meeting, contributor Noah Misch said that the guarantee of code review was one of the things which attracted him to the PostgreSQL project in the first place.

Unfortunately, over the last three releases, the last of the four CommitFests have been getting longer: one month for 9.0, two months for 9.1, and three for 9.2. Contributing to this problem is that the number of patch reviewers has not been growing as fast as the number of submissions, leading to a risk of reviewer burn-out.

The main reason the final CommitFest does not finish on schedule has been large, complex, not-quite-ready patches which need extensive review by core committers. In order to address this, PostgreSQL will add a planning period after the third, or penultimate, CommitFest in order to separate pending patches into large or small, and ready for committers or not quite ready. Then, in the third week of the final CommitFest, the team will be able to do a final triage of what's ready for release and what's not, and conclude the CommitFest after the expected 30 days.

Another cause of contributor pain has been submitting new features without an agreed specification, often causing those patches to need complete recoding. In order to add a more formal process around specification review, the project will be adding a "design specification" section to the CommitFests and encouraging people to submit specifications for review. The developers also adopted a "one patch, one review" policy, which means that patch contributors are obligated to review at least one patch contributed by someone else.

Conquering the galaxy with PostgreSQL

As a new event this year, PGCon included a Schemaverse tournament. Schemaverse is a text-mode galactic conquest wargame, played entirely in a PostgreSQL database. Users issue commands to build and move spaceships, engage in battles, and conquer and mine planets, all in the form of SQL queries. You win the game either by conquering the most planets, or by successfully hacking the database.

Schemaverse was invented by Josh McDougall as a fun data security exercise for the DEFCON 2011 security conference. Notably, no player at DEFCON, PGCon, nor in the open online version has yet managed to win by hacking the PostgreSQL database. The game also demonstrates how PostgreSQL allows sophisticated application functionality to be built either inside or outside the database.

The PGCon tournament, which ran for the entirety of the conference, was won by Chris Browne of the domain registrar Afilias. Chris spent weeks leading up to PGCon honing automated "artificial intelligence" scripts to control his fleet of spaceships, and rapidly seized over 500 planets. He won several prizes including a championship sweatshirt.

Other conference highlights

Peter van Hardenberg, database lead at platform-as-a-service company Heroku, delivered the keynote for PGCon. His talk was about Heroku's users, the cloud, NoSQL, and the future of PostgreSQL. First he discussed the large number of PostgreSQL databases (1.6 million) belonging to Heroku's hundreds of thousands of users, who are mostly small-team agile-development web application developers. Van Hardenberg appealed to the PostgreSQL developers to keep Heroku's users in mind when working on PostgreSQL and new features, particularly emphasizing the need for usability improvements.

In case studies at the conference NASA scientists talked about using Postgres for data collection on the Aura spacecraft for ozone monitoring. 2ndQuadrant staff presented migrating top website Fotolog from MySQL to PostgreSQL. A contributor to open source library management system Evergreen demonstrated their application and explored its data architecture. And Mozilla staff talked about the Socorro crash collection system.

Other talks focused on many of the features and internals of PostgreSQL 9.2, currently in beta. This included Range Types, the new replication protocol, and the internals of how many of the performance improvements were implemented. As they did last year, PostgreSQL's Russian development team introduced another new indexing idea, this time for similarity searching.

PGCon 2012 was a busy, intense week for PostgreSQL developers. Users can expect a lot more great things coming out of the project because of it. If your job centers around PostgreSQL, you should consider attending next year. Later this year, you can attend Postgres Open, a business and user-oriented conference in Chicago, or PostgreSQL Europe in Prague.

(Log in to post comments)

Also, consider DRBD

Posted Jun 1, 2012 13:30 UTC (Fri) by Richard_J_Neill (subscriber, #23093) [Link]

One other way to do database replication (between just two servers, where one is a hot standby) is to use DRBD at the filesystem layer. This actually works quite well, including good performance.

We have two adjacent servers, each of which has paired RAID 1 (mirrored) SSDs for /var/lib/pgsql/data. The servers are connected by a dedicated gigabit ethernet cable, and run drbd with protocol B. Postgresql runs only on the primary machine. When the application is informed of a successful Commit, the data is on the primary SSD, and at least in RAM on the secondary server.

It's possible to get completely seamless failover here, using eg heartbeat; though we prefer to require manual intervention because, for us, dataloss is bad; 5 minutes' downtime is acceptable in an emergency; "split-brain" would be a real problem.

Also, consider DRBD

Posted Jun 1, 2012 16:28 UTC (Fri) by jberkus (subscriber, #55561) [Link]

Richard,

A fair number of people do this, and it certainly works as far as redundancy and failover is concerned. However, since DRBD replicated the whole filesystem, it has to copy a LOT more data than database replication does, and as a result performance is characteristically much worse, especially with less than ideal disk or network speeds. The Postgres+DRBD systems I've consulted on had response times on small writes which were at least 3X longer than a standalone system.

There's a commercial company (I've forgotten the name) which has enhancements to DRBD including an robust asyncronous mode which improves on basic DRBD performance.

Also, consider DRBD

Posted Jun 1, 2012 16:44 UTC (Fri) by Richard_J_Neill (subscriber, #23093) [Link]

We had to deploy this into production ~ 3 years ago, when there were fewer alternatives. I agree that it hurt performance a bit, though in fact we still got pretty awesome performance: the secondary server was completely idle apart from the drbd slave, and the network was a single 6' cable between 2 dedicated gigabit cards. With protocol B, I think we found that DRBD's copying data to the secondary server's RAM took less time than for the primary server to write to its disk (at least for larger datasets).

I'm curious as to why replicating the whole filesystem has to copy much more data than other forms of database replication: I thought that typically the ext4 overhead was quite small?

As you say, for small writes, there are some problems, and there is some case for deciding on a per-transaction-type basis whether that transaction is critical, or slightly less critical. (i.e. if someone puts an axe through the server right now, how much do the last 20 ms of that type of data matter?). Either way, postgresql is an amazing product :-)

Also, consider DRBD

Posted Jun 1, 2012 17:52 UTC (Fri) by nix (subscriber, #2304) [Link]

When you copy the whole filesystem, you're copying across changes to the WAL (journal) as well as changes to the data/ files it is journalling. As the WAL is frequently (f)(data)sync()ed, this overhead is even higher. I am also sceptical as to its safety: it seems easy to me to synch across a change to the WAL on commit or WAL rollover but lose power before the corresponding change to the datafile is synched: on restart, the remote PostgreSQL will think that no WAL replay is necessary, when in fact one is needed.

PostgreSQL's native replication simply streams the WAL across, and replays it into the datafiles on the remote node. This is guaranteed safe.

Also, consider DRBD

Posted Jun 1, 2012 22:33 UTC (Fri) by jberkus (subscriber, #55561) [Link]

Nix,

We've had reason to destruction-test DRBD+Postgres. From a data safety perspective it works as well as one could hope, provided that you take steps to prevent split-brain. It's just performance which leaves something to be desired.

Also, consider DRBD

Posted Jun 4, 2012 12:36 UTC (Mon) by nix (subscriber, #2304) [Link]

Oh, I'm willing to believe it works most of the time. I just can't see how it's safe 100% of the time (when WAL rollover happens). However, it might be that the failure cases require long periods of downtime and very unluckily-timed powerdowns, in which case you might never see it.

But quite possibly I'm missing something.

Also, consider DRBD

Posted Jun 4, 2012 15:07 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

As long as the snapshot is atomic it *has* to work. Otherwise the original purpose of the wal - crash recovery - wouldn't be met.
Checkpoints are crash safe. Whats the problem youre seeing there?
The checkpoint record is only written to the wal *after* everything but the checkpoint information has been written out. Only after the checkpoint has been fsynced to disk resources - like the wal - are reused.

Also, consider DRBD

Posted Jun 4, 2012 18:18 UTC (Mon) by nix (subscriber, #2304) [Link]

Agreed, this is perfectly fine. I now suspect that my memory is lying to me: it tells me faintly that DRBD may transmit data in arbitrary order and does not do a complete transmit on fsync(), but I now suspect I'm thinking of some other distributed block device and just mixed it up with DRBD. If DRBD respects fsync(), then everything works.

Also, consider DRBD

Posted Jun 4, 2012 18:25 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

I think you can configure it in a way not all required guarantees are met. They are not generally recommended as far as I remember though.
...
Yep: http://www.drbd.org/users-guide/re-drbdconf.html check the docs for disk-barrier.

Also, consider DRBD

Posted Jun 1, 2012 21:25 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

The problem is that you cannot use the 2nd server for anything (running backups, reporting queries, ...) that way. Also you have the problem that starting with crash recovery (which is what will happen if you failover to the 2nd server, because pg hasn't been cleanly shut down) can take a long time if you have a big time and allow for some wal to collect for performance reasons (checkpoint_segments).

Also, consider DRBD

Posted Jun 4, 2012 10:08 UTC (Mon) by niner (subscriber, #26151) [Link]

That's not entirely true. If your DRBD device is on top of LVM you can take a snapshot and mount this snapshot even on the secondary. We use such a setup for backups and it works just fine. You could even run PostgreSQL on this snapshot and do your reports from it.

Of course LVM snapshots have their own performance problems and just using PostgreSQL's native replication might be easier and faster, but still just using DRBD is a possibility which allows to cover not just the database.

Also, consider DRBD

Posted Jun 4, 2012 10:36 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

Sure, you can do that or similar things (loads of SANs have that capability). Keeping the secondary pg instance up2date is really expensive in that scenario though. You need to shutdown pg, drop old, create new snapshot, start pg which will do recovery. If you have nontrivial amounts of writes the wal replay uppon startup will take quite some time....
I don't really see any reason to do so these days unless your database is gigantonormous and you cannot afford to have a full copy of the datadir for reporting.

Apr	MAY	Jun
	16
2012	2013	2014