User Churn Analysis

Codeflix, a streaming video startup, is interested in measuring their user churn rate. Four months into launching Codeflix, management asked to look into subscription churn rates. It’s early on in the business and people are excited to know how the company is doing.

The marketing department is particularly interested in how the churn compares between two segments of users. They provide a dataset containing subscription data for users who were acquired through two distinct channels.

The dataset provided contains one SQL table, subscriptions. Within the table, there are 4 columns:

  1. id - the subscription id
  2. subscription_start - the start date of the subscription
  3. subscription_end - the end date of the subscription
  4. segment - this identifies which segment the subscription owner belongs to
Codeflix requires a minimum subscription length of 31 days, so a user can never start and end their subscription in the same month.

In this project, I focused on answering these questions about their churn:
  1. How many months has the company been operating?
  2. Which months do I have enough information to calculate a churn rate?
  3. What segments of users exist?
  4. What is the overall churn trend since the company started?
  5. Churn rates comparison between user segments?
  6. Which segment of users should the company focus on expanding?

1. Take a look at the first 100 rows of data in the subscriptions table. How many different segments?
SELECT * FROM subscriptions LIMIT 100;
SELECT DISTINCT segment FROM subscriptions; -- 2, segment 87 and 30

2. Determine the range of months of data provided. Which months to calculate churn for?
-- start 
SELECT DISTINCT subscription_start FROM subscriptions ORDER BY 1; -- From 2016-12-01 to 2017-3-30
-- end
SELECT DISTINCT subscription_end FROM subscriptions ORDER BY 1; -- From 2017-01-01 to 2017-03-31 We can calculate churn rate from January 2017.

3. Calculate the churn rate for both segments (87 and 30) over the first 3 months of 2017. To get started, create a temporary table of months.
WITH months AS(
  SELECT '2017-01-01' AS first_day,
  '2017-01-31' AS last_day
  UNION
  SELECT '2017-02-01' AS first_day,
  '2017-02-28' AS last_day
  UNION
  SELECT '2017-03-01' AS first_day,
  '2017-03-31' AS last_day
) SELECT * FROM months;

-- Create a temporary table, cross_join, from subscriptions and the months.
WITH months AS(
  SELECT '2017-01-01' AS first_day,
  '2017-01-31' AS last_day
  UNION
  SELECT '2017-02-01' AS first_day,
  '2017-02-28' AS last_day
  UNION
  SELECT '2017-03-01' AS first_day,
  '2017-03-31' AS last_day
),
    cross_join AS(
  SELECT * FROM subscriptions
  CROSS JOIN
  months
) SELECT * FROM cross_join;

/* Create a temporary table, status, from the cross_join table that I created. This table should contain:
1. id selected from cross_join
2. month as an alias of first_day
3. is_active_87 created using a CASE WHEN to find any users from segment 87 who existed prior to the beginning of the month. This is 1 if true and 0 otherwise.
5. is_active_30 created using a CASE WHEN to find any users from segment 30 who existed prior to the beginning of the month. This is 1 if true and 0 otherwise.

Create a status_aggregate temporary table that is a SUM of the active and canceled subscriptions for each segment, for each month.

The resulting columns should be:
1. sum_active_87
2. sum_active_30
3. sum_canceled_87
4. sum_canceled_30
*/
/*
My Result(during 3 months):
1. sum_active_87 --> 1271
2. sum_active_30 --> 1525
3. sum_canceled_87 --> 476
4. sum_canceled_30 --> 144

Result(by month):
January:
sum_active_87 --> 278
sum_active_30 --> 291
sum_canceled_87 --> 70
sum_canceled_30 --> 22
February:
sum_active_87 --> 462
sum_active_30 --> 518
sum_canceled_87 --> 148
sum_canceled_30 --> 38
March:
sum_active_87 --> 531
sum_active_30 --> 716
sum_canceled_87 --> 258
sum_canceled_30 --> 84 
*/

WITH
months AS(
  SELECT '2017-01-01' AS first_day,
  '2017-01-31' AS last_day
  UNION
  SELECT '2017-02-01' AS first_day,
  '2017-02-28' AS last_day
  UNION
  SELECT '2017-03-01' AS first_day,
  '2017-03-31' AS last_day
),
cross_join AS(
  SELECT * FROM subscriptions
  CROSS JOIN
  months
), 
status AS(
  SELECT id, first_day AS month,
  CASE
    WHEN 
      subscription_start < first_day AND 
      segment == 87 AND 
      (subscription_end > first_day OR subscription_end IS NULL)
    THEN 1 ELSE 0
    END AS is_active_87,
  CASE
    WHEN
      subscription_start < first_day AND 
      segment == 30 AND
      (subscription_end > first_day OR subscription_end IS NULL)
      THEN 1 ELSE 0
    END AS is_active_30,
  CASE
    WHEN
      subscription_end BETWEEN first_day AND last_day
      AND segment == 87
      THEN 1 ELSE 0
    END AS is_canceled_87,
  CASE
    WHEN
      subscription_end BETWEEN first_day AND last_day
      AND segment == 30
      THEN 1 ELSE 0
    END AS is_canceled_30
  FROM cross_join),
status_aggregate AS (
  SELECT
    month,
    SUM(is_active_87) AS sum_active_87,
    SUM(is_active_30) AS sum_active_30,
    SUM(is_canceled_87) AS sum_canceled_87,
    SUM(is_canceled_30) AS sum_canceled_30
  FROM status GROUP BY 1),
churn_rate AS(
  SELECT
    month,
    ROUND(sum_canceled_87*1.0/sum_active_87,2) AS churn_rate_87,
    ROUND(sum_canceled_30*1.0/sum_active_30,2) AS churn_rate_30
  FROM status_aggregate)
SELECT * FROM churn_rate;

4. Calculate the churn rates for the two segments over the three month period. Which segment has a lower churn rate?

/*
Result(by month): segment 30 has lower churn rate than segment 87.
January:
churn_rate_87 = 0.25
churn_rate_30 = 0.08
February:
churn_rate_87 = 0.32
churn_rate_30 = 0.07
March:
churn_rate_87 = 0.49
churn_rate_30 = 0.12
*/