Building funnels with SQL doesn’t need to be hard. It’s often touted as the hardest thing to do in SQL and that you would need hundreds of lines with complex joins and window functions.

This isn’t true, there’s an easier way and it involves some regular expressions and a nice function called string_agg in BigQuery. This function is also known as listagg in Redshift or Snowflake. We will use the BigQuery syntax for all examples here.

Say you have a table of events called course.google_analytics. This table has one row for every event fired off by the user. The first step is to aggregate this table for every session and use the function string_agg to create a comma separated list of events this user performed in the order that they occurred.

select
   fullvisitorid,
   visitstarttime,
   string_agg(event_category order by hits_hit_number asc) as events_in_session,
   max(hits_hit_number) as hits_count
from
   course.google_analytics
group by
   fullvisitorid,
   visitstarttime;

Create comma separated events column

We’ve now created the column events_in_session which summarises the events that occurred in each session. Each session is defined by a user id (fullvisitorid) and a session start time (visitstarttime) pair.

The next step is to write a simple regular expression to determine which sessions had the events Home, Search and Article performed in that order.

We will use the function regexp_contains and the regular expression Home,Search,Article. In the query below we count the number of sessions where the funnel step defined was completed.

select
   regexp_contains(events_in_session, 'Home,Search,Article') as funnel_step,
   count(distinct concat(fullvisitorid, visitstarttime)) as number_of_sessions
from
   session_sequence
group by
   funnel_step;

Create comma separated events column

This is the most simple funnel as it has only one step, we can easily extend this by adding in more steps with more grouping columns as shown:

select
   regexp_contains(events_in_session, 'Home') as step_1,
   regexp_contains(events_in_session, 'Home,Search') as step_2,
   regexp_contains(events_in_session, 'Home,Search,Article') as step_3,
   count(distinct concat(fullvisitorid, visitstarttime)) as number_of_sessions
from
   session_sequence
group by
   step_1,
   step_2,
   step_3;

By representing the problem in this way, we can also answer more complex questions:

  • Do we want to count sessions only if they start with these three events?
    • ^Home,Search,Article

This is useful if we want to understand how users are typically starting out their sessions on our product.

  • Do we care if there are any events in between each funnel step?
    • Home.*Search.*Article

Sometimes we might really care if two events occur immediately after each other. Most times, we want to ignore events if they aren’t part of the funnel.

  • Do we want to count sessions only if they end with these three events?
    • Home,Search,Article$

This is useful if we want to understand what users are doing immediately before ending their session.

Any time I want to answer questions regarding sequences in structured data, my go-to is to use regular expressions. It simplifies the problem tremendously and it’s much less likely that self-joins and window functions will be needed. Incidently, this method also seems to be more performant when I last tested it on Redshift and Snowflake on a sizeable transaction table compared to other methods using self-joins and window functions.

On Snowflake, there’s a function called match_recognize which helps with building funnels in SQL and it seems to have a bit more functionality, it might be worth giving it a go too.

If you’re a product manager or marketer and you’d like to learn how to become awesome at SQL analytics to solve real world problems, check out supercooldata.