Building funnels with SQL doesn’t need to be hard. It’s often touted as the hardest thing to do in SQL and that you would need hundreds of lines with complex joins and window functions.
This isn’t true, there’s an easier way and it involves some regular expressions and a nice function called
string_agg in BigQuery. This function is also known as
listagg in Redshift or Snowflake. We will use the BigQuery syntax for all examples here.
Say you have a table of events called
course.google_analytics. This table has one row for every event fired off by the user. The first step is to aggregate this table for every session and use the function
string_agg to create a comma separated list of events this user performed in the order that they occurred.
select fullvisitorid, visitstarttime, string_agg(event_category order by hits_hit_number asc) as events_in_session, max(hits_hit_number) as hits_count from course.google_analytics group by fullvisitorid, visitstarttime;
We’ve now created the column
events_in_session which summarises the events that occurred in each session. Each session is defined by a user id (
fullvisitorid) and a session start time (
The next step is to write a simple regular expression to determine which sessions had the events
Article performed in that order.
We will use the function
regexp_contains and the regular expression
Home,Search,Article. In the query below we count the number of sessions where the funnel step defined was completed.
select regexp_contains(events_in_session, 'Home,Search,Article') as funnel_step, count(distinct concat(fullvisitorid, visitstarttime)) as number_of_sessions from session_sequence group by funnel_step;
This is the most simple funnel as it has only one step, we can easily extend this by adding in more steps with more grouping columns as shown:
select regexp_contains(events_in_session, 'Home') as step_1, regexp_contains(events_in_session, 'Home,Search') as step_2, regexp_contains(events_in_session, 'Home,Search,Article') as step_3, count(distinct concat(fullvisitorid, visitstarttime)) as number_of_sessions from session_sequence group by step_1, step_2, step_3;
By representing the problem in this way, we can also answer more complex questions:
- Do we want to count sessions only if they start with these three events?
This is useful if we want to understand how users are typically starting out their sessions on our product.
- Do we care if there are any events in between each funnel step?
Sometimes we might really care if two events occur immediately after each other. Most times, we want to ignore events if they aren’t part of the funnel.
- Do we want to count sessions only if they end with these three events?
This is useful if we want to understand what users are doing immediately before ending their session.
Any time I want to answer questions regarding sequences in structured data, my go-to is to use regular expressions. It simplifies the problem tremendously and it’s much less likely that self-joins and window functions will be needed. Incidently, this method also seems to be more performant when I last tested it on Redshift and Snowflake on a sizeable transaction table compared to other methods using self-joins and window functions.
On Snowflake, there’s a function called
match_recognize which helps with building funnels in SQL and it seems to have a bit more functionality, it might be worth giving it a go too.
If you’re a product manager or marketer and you’d like to learn how to become awesome at SQL analytics to solve real world problems, check out supercooldata.