Pivotal Engineering Journal

Technical articles from Pivotal engineers.

ByteA versus TEXT in PostgreSQL (for textual data)

One of our customers switched from MongoDB to PostgreSQL, and the migration tool created all data fields as ByteA instead of TEXT. Makes one wonder, if there is a performance difference and if TEXT could be a wiser choice.

Posted on by
Categories:   PostgreSQL    Performance   
Edit this post on GitHub.

One of our customers switched from MongoDB to PostgreSQL, because the JSON performance in MongoDB was not sufficient.

The application is already running faster on PostgreSQL, but for now it must run in parallel on both platforms, and queries (using a framework) have to work equally on both databases. Therefore the superior JSON and JSONB functionality in PostgreSQL can’t be used here - for now. However the migration tool chose to make every data field a ByteA column in PostgreSQL - and I wondered if that is a good choice. Therefore I created a small test case to measure the read and write performance of ByteA and TEXT. A common recommendation is to encode data somehow (in Base64, as example) and then store it in TEXT instead of ByteA. So I measured the overhead of that approach as well.

Test approach

The actual test is written in Perl, but here is an abstract of what is measured.

The database

The single table used here is always an integer column (think of a primary key), and a data field:

CREATE TABLE test (id INT, data BYTEA);
CREATE TABLE test (id INT, data TEXT);

I decided not to create a Primary Key, in order to avoid the performance overhead.

The data

For every test I chose a string with 1000 random characters.

use String::Random;

my $data = "";
my $write_data = "";
for (my $a = 1; $a <= 100; $a++) {
    my $rand = new String::Random;
    if ($param2 eq 'BYTEA') {
        $data .= $rand->randpattern("bbbbbbbbbb");
        $write_data = $data;
    } elsif ($param2 eq 'TEXT') {
        $data .= $rand->randpattern("..........");
        $write_data = $data;
    } elsif ($param2 eq 'BASE64') {
        $data .= $rand->randpattern("bbbbbbbbbb");
        $write_data = encode_base64($data);
    }
}

The Perl module “String::Random” takes care of creating a random string. For ByteA, a truly random character, for TEXT the string is in the printable characters area. The Base64 encoding takes place here, but the original string is stored and compared later with what was read from the database (not part of the performance test).

The read test

To measure the read performance, a single string is written into the table, and then read multiple times.

for (my $a = 1; $a <= $param1; $a++) {
    $start_time = [gettimeofday];
    $query = "SELECT * FROM test";
    $st = $main::db->prepare($query);
    $st->execute;
    $row = $st->fetchrow_hashref;
    $st->finish;
    $end_time = [gettimeofday];
    $intermediate_time += tv_interval($start_time, $end_time);
}

No index overhead, just a very simple query and one single row in the table. The average over three runs is used. The data is compared which was written to disk, but the comparisation happens after the second time is taken.

The write test

To measure the write performance, the data is written into the table using a prepared statement, only the execution time is measured.

$query = "INSERT INTO test (id, data) VALUES (?, ?)";
$st = $main::db->prepare($query);
for (my $a = 1; $a <= $param1; $a++) {
    $start_time = [gettimeofday];
    $st->bind_param(1, '1');
    if ($param2 eq 'BYTEA') {
        $st->bind_param(2, $write_data, { pg_type => PG_BYTEA });
    } elsif ($param2 eq 'TEXT') {
        $st->bind_param(2, $write_data, { pg_type => PG_TEXT });
    } elsif ($param2 eq 'BASE64') {
        $st->bind_param(2, $write_data, { pg_type => PG_TEXT });
    }
    $st->execute;
    $end_time = [gettimeofday];
    $intermediate_time += tv_interval($start_time, $end_time);
}
$st->finish;

Again the results are averaged over three runs.

The results

Read performance

Many use cases rely on a good read performance. Not surprisingly, the performance for the TEXT datatype is around 15% better than the ByteA datatype.

It is a bit surprising that the overhead for reading base64 encoded data is around 10% better than reading the binary data in ByteA.

Write performance

At first glance, the numbers for writing into the different datatypes show no real difference.

Looking into the detail, there is a very small difference, and using TEXT is only slightly faster than using ByteA.

Conclusions

Switching from ByteA to TEXT will result in an instant performance improvement for read queries. That is a big gain, given that no application code has to be changed. The write performance will not change much.

Even switching from ByteA to encoded data (like Base64) will improve the read performance - this approach can be considered if the application can be changed, and truly binary data is stored in the database.

Transformation

In our case, the table is created automatically, and it is not a problem to redo the entire operation using TEXT instead of ByteA. However if you are in a similar situation, the following SQL command shows how to change the datatype:

ALTER TABLE <table name> ALTER COLUMN <column name> TYPE TEXT;