Many web applications show some sort of data analysis based on the state of the database. These analyses are sometimes used for dashboards, to power charts, graphs, and tables, or in more advanced cases to determine decisions that the application will make (smart apps).
Far too often application developers query the database for the raw data and perform aggregates and other basic analysis in the programming language of choice (aka our “comfort zone”). This not only wastes valuable CPU cycles (especially for larger data sets), but introduces more opportunities for bugs.
In this article we'll learn a few nice sql and postgres features that make the analysis easy to understand and doable in a single query! We'll start with some basics such as joins and aggregates, and then move on to less known functionality such as window functions. Check out all the code in the github repo.
Let's go ahead and create a couple of tables to work with:
CREATE TABLE trail ( id SERIAL PRIMARY KEY, name VARCHAR(255) ); CREATE TABLE ride ( id SERIAL PRIMARY KEY, user_id INTEGER, trail_id INTEGER, started_at TIMESTAMP, completed_at TIMESTAMP );
This gives us two relations that we'll base all of our exercises off of. The
trail table is quite straight forward with just a name and id (primary key). The
ride table is used to store records that represent a person (user_id) completing a ride on a given trail (trail_id). We also know how long the user took to complete the ride given the start and end timestamps.
Let's use postgres to create some test data for our 2 tables:
INSERT INTO trail (name) VALUES ('Dirty Bismark'), ('Betasso'), ('Walker Ranch'), ('Picture Rock'), ('Hall Ranch'); INSERT INTO ride (user_id, trail_id, started_at) SELECT (random() * 9) :: INTEGER + 1 AS user_id, (random() * 4) :: INTEGER + 1 AS trail_id, TIMESTAMP '2014-01-01 00:00:00' + random() * (now() - TIMESTAMP '2014-01-01 00:00:00') AS started_at FROM generate_series(1, 10000); UPDATE ride SET completed_at = (started_at + (60*60 + (random() * 180 * 60) :: INTEGER) * '1 second' :: INTERVAL);
generate_series() is a pretty cool function that belongs to a class of postgres’ set returning functions, which is why it's used in our FROM clause when we create rides. In this case we create 10,000 ride records with a
user_id between 1 and 10, and a
trail_id between 1 and 5.
Then we update each ride, setting the completed_at (which we didn't set during the insert) anywhere between 1 and 3 hours after the started_at.
Let's kick things off by finding out how many times each trail has been ridden each month. Let's aim for our result set to look like this:
|4||Picture Rock||Jan 2014||127|
|3||Walker Ranch||Aug 2015||124|
We can see that information from both of our tables (trail and ride) will be needed here, so we'll probably need a join. Also the times_ridden column show the
count over the month and trail, so we'll need to group by both of those. We can use
date_trunc() on the ride.started_at to remove the day and time information, followed by
to_char() to format the month and year to be human readable.
SELECT trail_id AS id, trail.name AS trail_name, to_char(date_trunc('month', completed_at), 'Mon YYYY') AS month, count(*) AS count FROM ride JOIN trail ON trail_id = trail.id GROUP BY 1, 2, 3 ORDER BY 4 DESC;
And that should do it! Notice we need to group by both the trail_id (or trail.id) and the trail.name. Once
GROUP BY is used in conjunction with our aggregate function (count), everything else that we want to select needs to be squashed or collapsed.
Also notice the ordinal reference used in both
GROUP BY and
ORDER BY. This is purely for convenience, but be careful: if you reorder your selects and forget to change your ordinal references, it's game over.
There is a subtle issue with the previous query. If a trail wasn't ridden during any month the result will not include a row at all. It would be much nicer to include the row with a 0 value for times_ridden. One solution is to approach the query from the
months perspective: first let's generate a set of all months that we would like to consider, then we can join from there to our previous query:
WITH ride_range AS ( SELECT min(completed_at) AS first_ride, max(completed_at) AS last_ride FROM ride ), months AS ( SELECT date_trunc('month', generate_series(ride_range.first_ride, ride_range.last_ride, '1 month')) AS month FROM ride_range ) SELECT trail.id AS trail_id, trail.name AS trail_name, to_char(months.month, 'Mon YYYY') AS month, count(ride.id) AS times_ridden FROM months JOIN trail ON TRUE LEFT JOIN ride ON ride.trail_id = trail.id AND months.month = date_trunc('month', ride.completed_at) GROUP BY 1, 2, 3 ORDER BY 4;
Whoa! A couple of new concepts here. If you haven't seen common table expressions (more commonly referred to as “with queries”) before, this query will look a strange. CTEs allow us to store a temporary relation by name (ride_range) and then use that relation in a subsequent query. Keep in mind you could always use a subquery in place, but CTEs greatly improve code readability as they allow you to think through problems sequentially. This tool can be very powerful; we can now solve some tough problems by breaking it down step by step, massaging the data along the way.
Additionally, you should recognize the generate_series function from earlier. In addition to integers
generate_series() can also be used with timestamps and an interval (1 month). This gives us the full set of months between the 2 dates.
The final query should look pretty similar. It's nearly identical to the previous section, but we select from months first followed by joining to trail and ride. The
LEFT JOIN is pretty important here as we still want a row even if no ride exists for that trail and month.
Now let's see who the top 3 fastest riders are for each trail:
We can accomplish this with 2 new tools:
While tremendously powerful, window functions are one of the less understood features of SQL. They allow you to perform aggregate like functions over a set of rows, but they do not force you to squash or collapse those rows into one. We can use this feature to assign an ordered rank by “partitioning” the data into buckets (one bucket per trail).
DISTINCT lets you find a unique set of values,
DISTINCT ON works a bit differently. It will help to find a distinct row in a grouping while maintaining identity, unlike GROUP BY. Since our leaderboard shouldn't show the same user more than once, even if they have the three fastest times, DISTINCT ON will be our friend.
Ok, let's take a look at the query now:
WITH fastest_ride_per_user AS ( SELECT DISTINCT ON (user_id, trail_id) user_id, trail_id, ride.completed_at - ride.started_at AS ride_time FROM ride ORDER BY user_id, trail_id, ride_time ) , all_ranks as ( SELECT trail_id AS id, user_id, to_char(ride_time, 'HH24:MI:SS') AS ride_time, rank() OVER (PARTITION BY trail_id ORDER BY ride_time) AS rank FROM fastest_ride_per_user ) select * from all_ranks where rank <= 3 ORDER BY trail_id, rank;
The fastest_ride_per_user CTE uses DISTINCT ON to find a single record for each user and trail combination. How does it determine which record to choose if there are more than one per combo? It uses the
ORDER BY clause to choose the smallest
ride_time. Please note that the initial order by columns must include all columns passed to DISTINCT ON. If no additional columns are passed to order by postgres will choose a record at random; no specific order is guaranteed.
The all_ranks CTE is where things get a bit more interesting. As previously mentioned, window functions allow you to perform aggregate like functions over a “group” without the need to collapse or squash those rows together. In this case (
PARTITION BY trail_id ORDER BY ride_time) our “group” (or partition) is the trail_id, and we determine rank based on the smallest ride_time. This will give us a full leaderboard for all trails and users. The last step is to simply limit the results to the top three. This outlines the iterative approach that CTEs give us, and the sql is much easier to read versus sub queries.
What if we wanted to find out the longest consecutive number of days each trail has been ridden? Let's aim for the following result:
This shows us that Walker Ranch trail has the longest streak at 128 consecutive days ridden (by all users), while the Dirty Bismark trail has the shortest streak at 21 days.
Let's start off with a building block and determine all days that a trail has been ridden at least once:
SELECT trail_id, date_trunc('day', completed_at) :: DATE AS day FROM ride GROUP BY 1, 2
Now that we have the unique combos for trail_id and day, we need to find a way to figure out if days are consecutive and group them together. We can accomplish this by subtracting the
row_number() within each trail (partition) from the day:
WITH trail_day_ridden AS ( SELECT trail_id, date_trunc('day', completed_at) :: DATE AS day FROM ride GROUP BY 1, 2 ) SELECT trail_id, day, day - now() :: DATE - row_number() OVER (PARTITION BY trail_id ORDER BY day) AS grouping FROM trail_day_ridden
This is very cool! Consecutive days get the same
Now that we have a grouping key for consecutive days, the remainder of the query becomes simpler. We can calculate the streaks with the count function and
GROUP BY our trail_id and grouping key:
WITH trail_day_ridden AS ( SELECT trail_id, date_trunc('day', completed_at) :: DATE AS day FROM ride GROUP BY 1, 2 ) , consecutive_groups AS ( SELECT trail_id, day, day - now() :: DATE - row_number() OVER (PARTITION BY trail_id ORDER BY day) AS grouping FROM trail_day_ridden ) , all_streak_counts AS ( SELECT trail_id, count(*) streak FROM consecutive_groups GROUP BY trail_id, grouping ) SELECT trail.id as id, trail.name as trail_name, max(streak) as streak FROM all_streak_counts join trail on trail_id = trail.id GROUP BY 1, 2 ORDER BY streak DESC;
And finally we select the
max() streak only (per trail_id) and voila!
Credit goes out to Erwin Brandstetter on this stackoverflow answer. Thanks Erwin!
After learning some of the tools like DISTINCT ON, CTEs, and window functions, you will feel much more comfortable using SQL to manipulate medium to large datasets. While SQL and relational databases aren't necessarily the right tool for “big data” or advanced analysis, they do serve most business applications quite well for most use cases and reporting needs.
I hope that next time you feel the need to pull raw data back from the database and perform some type of manipulation in your comfort zone, you'll first try and do it in a faster, simpler, and less bug prone sql query!