We recently rewrote an entire application for a client. The original had been running in production for over 10 years. It was still used in production when we started the rewrite and thus had a lot of data. Our client wanted to do a Big Bang release, so we needed to transfer all of the data in a small amount of time into the new application.
Here are the constraints we were working under:
So we started with a simple script, and iterated over a few weeks to reach our goal. To make that happen it was important to run the ETL every day. Automating the process was very important to support continuous feedback and acceptance of user stories. The rest of the article will describe our automated process and what we learned during its development.
expdp
scp
Expdp and Impdp are Oracle recommended tools for exporting and importing database schemas and data.
There was a lot of data in the old system, we realized quickly that we would not import all of it. A quick win was to try and limit the amount of data we would export: either through time ranges or by entirely discarding some tables. In our case we could exclude an entire Audit table containing more than 16 million records.
Sample expdp
command:
export ORACLE_HOME=/u01/oracle/product/11.1.0/db_1
export ORACLE_SID=MY_DB_NAME
export TABLES_TO_EXPORT=production.reservation,production.payment,production.user
/u01/oracle/product/11.1.0/db_1/bin/expdp DB_EXPORT_USER/amazingPassword@//dbserver:1521/MY_DB_NAME dumpfile=etl.dmp logfile=etl.dmp.log content=data_only directory=DMPDIR tables=$TABLES_TO_EXPORT \
"query=production.reservation:\"WHERE creation_time >= '01-JAN-15'\", production.payment:\"WHERE valid = 1\""
We ran the above code using ssh
on the legacy production server.
(Note: DMPDIR
here refers to a directory record that we had to manually create in the Oracle database.)
The only thing worth noting about this step is to make sure to use compression. We noticed a significant speed increase in file transfer time. Somewhere close to 4 times as fast.
scp -C john@myserver:/u01/oracle/myDumps/etl.dmp ~/Downloads/myDumpfile.dmp
Also make sure you can automate this using ssh keys for authentication.
CentOS is by far the easiest Linux system to setup with Oracle. It’s also easy to setup in VMWare or VirtualBox. Make sure you run it on a beefy machine, having an SSD will improve restore and read times when transferring to Postgres later.
We use impdp
for the restore:
impdp production/production "DIRECTORY=downloads" "DUMPFILE=myDumpfile.dmp" "TABLE_EXISTS_ACTION=truncate"
The downloads
directory is similar to the previous DMPDIR
we used with expdp
.
It points at a shared folder we setup on our VM, which is located in our local MacOSX ~/Downloads
folder.
Make sure you truncate tables before importing using the TABLE_EXISTS_ACTION
,
this will make it a lot easier to run over and over again.
Before we can easily perform all the transformation of the data, we need to get an identical schema into our local Postgres database.
We tried multiple approaches, one of them was the ora2pg
script but we found that solution to be too slow.
What worked really well for us was using a Foreign Data Wrapper to extract the data from Oracle into an identical schema in our local Postgres database.
Foreign Data Wrapper
is not optimal to run complex queries against, but it works great for transferring data into Postgres using simple INSERT INTO... SELECT...
statements.
For this you will need to setup a schema with foreign data tables mapping to the Oracle tables you care about, and a local schema with the same structure. Then you can transfer the data like so:
INSERT INTO etl.payments (user_id, reservation_id, amount)
SELECT user_id, reservation_id, amount FROM foreign_data.payments
WHERE reservation_id IN (SELECT id FROM etl.reservation);
Here, etl
is the local Postgres schema, and foreign_data
is the one with the foreign tables.
As you see in our where
condition we were able to take advantage of existing imported data to limit the records we imported.
In our scenario we were with a set of clients that had grown in familiarity with Ruby and we decided to use Ruby as our language to both structure and test our Data Transformations.
At first there was much debate about which language/framework we should use, but in the end it did not really have an effect,
because most of our transformation code was written in pure sql and using a single database connection.
We used the sequel
gem to execute SQL directly on our Postgres database and it also allowed us additional flexibility for cases that were too complex to maintain with raw queries.
Because of these choices, testing became easy. We could insert data into our etl
schema, run our transformation service,
then assert the correct records were inserted into our public schema.
Some general guidelines on performing the transformations:
source_id
columns containing x-urn:origin_table:origin_id
style values in your target tables.
Refer to Uniform Resource Name.Before we dive into some of the useful SQL techniques we used, we think it’s worth mentioning some of our first steps in transforming the data. There were a lot of tables in the old system, and much of this data still exists in the new system. It could be intimidating at first, so we had to find a good starting point.
First we tried to tackle the easiest tables then work towards the hardest ones, the ones with more relations. This approach proved to be difficult, because much of the easier tables were shown on pages that were centered around some of the harder tables. This meant testing and accepting these ETL stories could not be done until we had more data.
After struggling through that approach, we took a step back and decided to do the exact opposite: we started with the most central table in the system. For our first iteration of the import, we associated it to “dummy” records, then iterated through the relationships until we no longer needed the “dummy” entries.
This allowed for us to immediately and continuously see the efforts of our transformations in our application.
Now onto some Postgres/SQL features we found most useful during this process!
This feature is great for breaking the problem down into easily understood chunks. It helps make queries easier to understand. For example:
WITH
payment_notes_xml AS (
SELECT
id AS payment_id,
unnest(xpath('//NOTE', notes :: XML)) AS note_xml
FROM payments
)
INSERT INTO payment_notes (payment_id, note, type)
SELECT
payment_id,
xpath('//TEXT/text()') AS note,
xpath('//TYPE/text()') AS type
FROM payment_notes_xml
Common table expressions (CTE) allow you to name a query for later use in your SQL. CTE’s can be read from (and joined to) as if they were a table in your database.
In this example, we also show the use of unnest and xpath.
The xpath
function allows us to extract the text and type from each NOTE
node in the notes XML column.
The unnest
function then expands the given array into individual rows.
The following XML would result in three rows being inserted into the payment_notes
table:
<NOTES>
<NOTE>
<TEXT>foo</TEXT>
<TYPE>Internal</TYPE>
</NOTE>
<NOTE>
<TEXT>bar</TEXT>
<TYPE>Internal</TYPE>
</NOTE>
<NOTE>
<TEXT>baz</TEXT>
<TYPE>Public</TYPE>
</NOTE>
</NOTES>
Window functions are functions that can be applied to a set of rows once they are grouped.
For example, you could find the most recent payments of each user using the row_number
window function:
WITH payment_ranks AS (
SELECT
id,
amount,
user_id,
paid_at,
row_number() OVER (PARTITION BY user_id ORDER BY paid_at DESC) AS rank
FROM payments
)
SELECT * FROM payment_ranks WHERE rank = 1;
Aggregate functions can be used to group multiple results into a single result.
WITH reservation_payment_users as (
SELECT
reservation_id,
array_agg(user_id) as all_user_ids
FROM payments
GROUP BY reservation_id
)
This example shows how you can select all of the user_ids that paid for a given reservation in one row.
Throughout our Transform step, we found the DISTINCT ON
feature to be rather helpful.
Whether we were trying to remove duplicate entries coming from the production database,
or grab the first associated record based on an ORDER BY
clause.
SELECT DISTINCT ON (payments.user_id)
users.id, payments.id
FROM users
LEFT JOIN payments ON payments.user_id = users.id
ORDER BY payments.paid_at DESC;
This example shows how we can get the latest payment for each user.
Once the transformations are done, generate a dump of your local database, then upload it to S3 using the provided s3-bash tools.
We upload it to S3 so that we can perform the restore using Heroku’s run
command, downloading the dump then restoring it.
Running it using a dyno is much faster as it will be on the same network as the database itself.
Using the pg_restore
command provides more flexibility than using
the built-in heroku pg:restore
command.
heroku run -s standard-2x "wget https://s3.amazonaws.com/my-app/dumps/myDumpFile -O tmp/etl.dmp && pg_restore --clean --no-acl --no-owner --verbose --jobs=6 --username=my_user -h my_host --dbname=my_db -p 1234 tmp/etl.dmp"
In order to find the best performance here, you will want to play with the --jobs
argument and the type of dyno you want to use,
-s standard-2x
here.
After a few weeks of development we were ready to run this against the production database and helped our client do the switch in just a few hours to the shiny new application!