SQL Hack: How to get first value based on time without a JOIN or a Window Function?
This is my favorite trick in SQL!
If you are like me and have been working with data for a while, you have googled “how to get the first value without a join or window function” multiple times in your life. We often hate having to write complex queries to get a seemingly simple answer.
For example, you want to answer questions like…
- Give me the first coupon code the customer used
- Give me the last page a customer visited
- Give me the first invoice amount for the customer
Common ways of answering these questions
- self JOIN on the MIN(timestamp)
- use a ROW_NUMBER() window function
- use FIRST_VALUE() window function
…Yuck. These approaches are slow and messy. 😩
Problems with these approaches…
- SQL query gets really long with multiple sub-queries or CTEs (Common Table Expressions)
- self JOINs and window functions are slow
The BETTER and FASTER WAY
In my day job at Narrator, I have to auto-generate complex queries, meaning that I have to get the first value of multiple columns at the same time.
If the columns were timestamps then this would be EASY
If you had a list of timestamps and you want the first one (based on the timestamp), this would not be a problem since you can just use MIN.
You can do a bunch of these in the same query SELECT and be done.
SELECT
[Group Columns],
MIN( [Time Column]) as first_value,
MAX( [Time Column]) as last_value
FROM table
GROUP by [Group Columns]
What if it’s not a timestamp?
We still take the MIN (or MAX) of the timestamp column, but we do a neat trick to bring along the actual column value we want with it.
SELECT
[Group Columns],
LTRIM(MIN(CONCAT([TIME COLUMN]),{value_column})), [TIME COLUMN]) AS first_value
FROM table
GROUP by [Group Columns]
This takes advantage of MIN/MAX and GROUP BY so you don’t have to JOIN or use a window function! I think this is faster, simpler, and cleaner to read than other methods.
Why does this work?
We concatenate the timestamp with the value of the column we’re looking for. After the warehouse does a normal MIN / MAX on that string we can remove the timestamp, leaving us with the data we want.
PROS
- Fast!
- Works inline (doesn’t require CTEs or self JOINs)
- Works with GROUP BY
CONS
- Kind of annoying to type
How it works
Let’s assume you have an orders
table and you want to find the timestamp of the first order based on category.
SELECT
category,
min(completed_at) as first_order_at
FROM orders
group by category
Now, what if you want the name
of that order?
Well, the trick is to CONCAT the name to the timestamp then compute the MIN then remove the timestamp.
The SQL:
LTRIM(MIN(CONCAT(completed_at), name)),MIN(completed_at))
1. Concatenate the timestamp and the column you want to find the first value for
CONCAT(completed_at, name)
This returns the following rows:
2018-01-01T01:48:02The Best Name
2018-01-01T02:12:06Some Name
2. Find the MIN of that string (the timestamp is at the beginning of the string so it will grab the first value)
MIN(CONCAT(completed_at, name))
This returns the following rows:
2018-01-01T01:48:02The Best Name
4. Remove the Timestamp from the result
LTRIM(MIN(CONCAT(completed_at, name)), MIN(completed_at))
This returns the following rows:
The Best Name
Note that LTRIM, when given a string as a second argument, will remove that string from the front of its first argument.
DONE!
To handle NULL values, use the query below
NULLIF(LTRIM(MIN(CONCAT({time_column}), NVL({value_column},'')), MIN({time_column}) ,'')
Here is the EXACT one I use (at Narrator). It’s a bit uglier but a bit more efficient (saves an aggregation)
NULLIF(SUBSTRING(MIN(CONCAT(DATE_TRUNC('second', {time_column}), NVL({value_column},''))),20, 1000),'')"
I hope you find this as helpful as I did! I have been doing data science for a while and love finding new hacks to make complex data tasks easier. These tricks really come in handy when dealing with event-based data (which is the structure of the data we use at Narrator).
Do you have SQL tricks? Share them here! I’d love to hear them.