We recently rewrote an entire application for a client. The original had been running in production for over 10 years. It was still used in production when we started the rewrite and thus had a lot of data. Our client wanted to do a Big Bang release, so we needed to transfer all of the data in a small amount of time into the new application.
Here are the constraints we were working under:
So we started with a simple script, and iterated over a few weeks to reach our goal. To make that happen it was important to run the ETL every day. Automating the process was very important to support continuous feedback and acceptance of user stories. The rest of the article will describe our automated process and what we learned during its development.
There was a lot of data in the old system, we realized quickly that we would not import all of it. A quick win was to try and limit the amount of data we would export: either through time ranges or by entirely discarding some tables. In our case we could exclude an entire Audit table containing more than 16 million records.
export ORACLE_HOME=/u01/oracle/product/11.1.0/db_1 export ORACLE_SID=MY_DB_NAME export TABLES_TO_EXPORT=production.reservation,production.payment,production.user /u01/oracle/product/11.1.0/db_1/bin/expdp DB_EXPORT_USER/amazingPassword@//dbserver:1521/MY_DB_NAME dumpfile=etl.dmp logfile=etl.dmp.log content=data_only directory=DMPDIR tables=$TABLES_TO_EXPORT \ "query=production.reservation:\"WHERE creation_time >= '01-JAN-15'\", production.payment:\"WHERE valid = 1\""
We ran the above code using
ssh on the legacy production server.
DMPDIR here refers to a directory record that we had to manually create in the Oracle database.)
The only thing worth noting about this step is to make sure to use compression. We noticed a significant speed increase in file transfer time. Somewhere close to 4 times as fast.
scp -C john@myserver:/u01/oracle/myDumps/etl.dmp ~/Downloads/myDumpfile.dmp
Also make sure you can automate this using ssh keys for authentication.
CentOS is by far the easiest Linux system to setup with Oracle. It's also easy to setup in VMWare or VirtualBox. Make sure you run it on a beefy machine, having an SSD will improve restore and read times when transferring to Postgres later.
impdp for the restore:
impdp production/production "DIRECTORY=downloads" "DUMPFILE=myDumpfile.dmp" "TABLE_EXISTS_ACTION=truncate"
downloads directory is similar to the previous
DMPDIR we used with
It points at a shared folder we setup on our VM, which is located in our local MacOSX
Make sure you truncate tables before importing using the
this will make it a lot easier to run over and over again.
Before we can easily perform all the transformation of the data, we need to get an identical schema into our local Postgres database.
We tried multiple approaches, one of them was the
ora2pg script but we found that solution to be too slow.
What worked really well for us was using a Foreign Data Wrapper to extract the data from Oracle into an identical schema in our local Postgres database.
Foreign Data Wrapper is not optimal to run complex queries against, but it works great for transferring data into Postgres using simple
INSERT INTO... SELECT... statements.
For this you will need to setup a schema with foreign data tables mapping to the Oracle tables you care about, and a local schema with the same structure. Then you can transfer the data like so:
INSERT INTO etl.payments (user_id, reservation_id, amount) SELECT user_id, reservation_id, amount FROM foreign_data.payments WHERE reservation_id IN (SELECT id FROM etl.reservation);
etl is the local Postgres schema, and
foreign_data is the one with the foreign tables.
As you see in our
where condition we were able to take advantage of existing imported data to limit the records we imported.
In our scenario we were with a set of clients that had grown in familiarity with Ruby and we decided to use Ruby as our language to both structure and test our Data Transformations.
At first there was much debate about which language/framework we should use, but in the end it did not really have an effect,
because most of our transformation code was written in pure sql and using a single database connection.
We used the
sequel gem to execute SQL directly on our Postgres database and it also allowed us additional flexibility for cases that were too complex to maintain with raw queries.
Because of these choices, testing became easy. We could insert data into our
etl schema, run our transformation service,
then assert the correct records were inserted into our public schema.
Some general guidelines on performing the transformations:
x-urn:origin_table:origin_idstyle values in your target tables. Refer to Uniform Resource Name.
Before we dive into some of the useful SQL techniques we used, we think it's worth mentioning some of our first steps in transforming the data. There were a lot of tables in the old system, and much of this data still exists in the new system. It could be intimidating at first, so we had to find a good starting point.
First we tried to tackle the easiest tables then work towards the hardest ones, the ones with more relations. This approach proved to be difficult, because much of the easier tables were shown on pages that were centered around some of the harder tables. This meant testing and accepting these ETL stories could not be done until we had more data.
After struggling through that approach, we took a step back and decided to do the exact opposite: we started with the most central table in the system. For our first iteration of the import, we associated it to “dummy” records, then iterated through the relationships until we no longer needed the “dummy” entries.
This allowed for us to immediately and continuously see the efforts of our transformations in our application.
Now onto some Postgres/SQL features we found most useful during this process!
This feature is great for breaking the problem down into easily understood chunks. It helps make queries easier to understand. For example:
WITH payment_notes_xml AS ( SELECT id AS payment_id, unnest(xpath('//NOTE', notes :: XML)) AS note_xml FROM payments ) INSERT INTO payment_notes (payment_id, note, type) SELECT payment_id, xpath('//TEXT/text()') AS note, xpath('//TYPE/text()') AS type FROM payment_notes_xml
Common table expressions (CTE) allow you to name a query for later use in your SQL. CTE's can be read from (and joined to) as if they were a table in your database.
In this example, we also show the use of unnest and xpath.
xpath function allows us to extract the text and type from each
NOTE node in the notes XML column.
unnest function then expands the given array into individual rows.
The following XML would result in three rows being inserted into the
<NOTES> <NOTE> <TEXT>foo</TEXT> <TYPE>Internal</TYPE> </NOTE> <NOTE> <TEXT>bar</TEXT> <TYPE>Internal</TYPE> </NOTE> <NOTE> <TEXT>baz</TEXT> <TYPE>Public</TYPE> </NOTE> </NOTES>
Window functions are functions that can be applied to a set of rows once they are grouped.
For example, you could find the most recent payments of each user using the
row_number window function:
WITH payment_ranks AS ( SELECT id, amount, user_id, paid_at, row_number() OVER (PARTITION BY user_id ORDER BY paid_at DESC) AS rank FROM payments ) SELECT * FROM payment_ranks WHERE rank = 1;
Aggregate functions can be used to group multiple results into a single result.
WITH reservation_payment_users as ( SELECT reservation_id, array_agg(user_id) as all_user_ids FROM payments GROUP BY reservation_id )
This example shows how you can select all of the user_ids that paid for a given reservation in one row.
Throughout our Transform step, we found the
DISTINCT ON feature to be rather helpful.
Whether we were trying to remove duplicate entries coming from the production database,
or grab the first associated record based on an
ORDER BY clause.
SELECT DISTINCT ON (payments.user_id) users.id, payments.id FROM users LEFT JOIN payments ON payments.user_id = users.id ORDER BY payments.paid_at DESC;
This example shows how we can get the latest payment for each user.
Once the transformations are done, generate a dump of your local database, then upload it to S3 using the provided s3-bash tools.
We upload it to S3 so that we can perform the restore using Heroku's
run command, downloading the dump then restoring it.
Running it using a dyno is much faster as it will be on the same network as the database itself.
command provides more flexibility than using
heroku pg:restore command.
heroku run -s standard-2x "wget https://s3.amazonaws.com/my-app/dumps/myDumpFile -O tmp/etl.dmp && pg_restore --clean --no-acl --no-owner --verbose --jobs=6 --username=my_user -h my_host --dbname=my_db -p 1234 tmp/etl.dmp"
In order to find the best performance here, you will want to play with the
--jobs argument and the type of dyno you want to use,
-s standard-2x here.
After a few weeks of development we were ready to run this against the production database and helped our client do the switch in just a few hours to the shiny new application!