A b o u t
W r i t i n g
D e s p i t e h a v i n g d o n e i t c o u n t l e s s t i m e s , I r e g u l a r l y f o r g e t h o w t o b u i l d a c o h o r t a n a l y s i s w i t h P y t h o n a n d p a n d a s . I ’ v e d e c i d e d i t ’ s a g o o d i d e a t o f i n a l l y w r i t e i t o u t - s t e p b y s t e p - s o I c a n r e f e r b a c k t o t h i s p o s t l a t e r o n . H o p e f u l l y o t h e r s f i n d i t u s e f u l a s w e l l .
I ’ l l s t a r t b y w a l k i n g t h r o u g h w h a t c o h o r t a n a l y s i s i s a n d w h y i t ’ s c o m m o n l y u s e d i n s t a r t u p s a n d o t h e r g r o w t h b u s i n e s s e s . T h e n , w e ’ l l c r e a t e o n e f r o m a s t a n d a r d p u r c h a s e d a t a s e t .
W h a t i s c o h o r t a n a l y s i s ?
A c o h o r t i s a g r o u p o f u s e r s w h o s h a r e s o m e t h i n g i n c o m m o n , b e i t t h e i r s i g n - u p d a t e , f i r s t p u r c h a s e m o n t h , b i r t h d a t e , a c q u i s i t i o n c h a n n e l , e t c . C o h o r t a n a l y s i s i s t h e m e t h o d b y w h i c h t h e s e g r o u p s a r e t r a c k e d o v e r t i m e , h e l p i n g y o u s p o t t r e n d s , u n d e r s t a n d r e p e a t b e h a v i o r s ( p u r c h a s e s , e n g a g e m e n t , a m o u n t s p e n t , e t c . ) , a n d m o n i t o r y o u r c u s t o m e r a n d r e v e n u e r e t e n t i o n .
I t ’ s c o m m o n f o r c o h o r t s t o b e c r e a t e d b a s e d o n a c u s t o m e r ’ s f i r s t u s a g e o f t h e p l a t f o r m , w h e r e " u s a g e " i s d e p e n d e n t o n y o u r b u s i n e s s ’ k e y m e t r i c s . F o r U b e r o r L y f t , u s a g e w o u l d b e b o o k i n g a t r i p t h r o u g h o n e o f t h e i r a p p s . F o r G r u b H u b , i t ’ s o r d e r i n g s o m e f o o d . F o r A i r B n B , i t ’ s b o o k i n g a s t a y .
W i t h t h e s e c o m p a n i e s , a p u r c h a s e i s a t t h e i r c o r e , b e i t t a k i n g a t r i p o r o r d e r i n g d i n n e r — t h e i r r e v e n u e s a r e t i e d t o t h e i r u s e r s ’ p u r c h a s e b e h a v i o r .
I n o t h e r s , a p u r c h a s e i s n o t c e n t r a l t o t h e b u s i n e s s m o d e l a n d t h e b u s i n e s s i s m o r e i n t e r e s t e d i n " e n g a g e m e n t " w i t h t h e p l a t f o r m . F a c e b o o k a n d T w i t t e r a r e e x a m p l e s o f t h i s - a r e y o u v i s i t i n g t h e i r s i t e s e v e r y d a y ? A r e y o u p e r f o r m i n g s o m e a c t i o n o n t h e m - m a y b e a " l i k e " o n F a c e b o o k o r a " f a v o r i t e " o n a t w e e t ? 1
W h e n b u i l d i n g a c o h o r t a n a l y s i s , i t ’ s i m p o r t a n t t o c o n s i d e r t h e r e l a t i o n s h i p b e t w e e n t h e e v e n t o r i n t e r a c t i o n y o u ’ r e t r a c k i n g a n d i t s r e l a t i o n s h i p t o y o u r b u s i n e s s m o d e l .
W h y i s i t v a l u a b l e ?
C o h o r t a n a l y s i s c a n b e h e l p f u l w h e n i t c o m e s t o u n d e r s t a n d i n g y o u r b u s i n e s s ’ h e a l t h a n d " s t i c k i n e s s " - t h e l o y a l t y o f y o u r c u s t o m e r s . S t i c k i n e s s i s c r i t i c a l s i n c e i t ’ s f a r c h e a p e r a n d e a s i e r t o k e e p a c u r r e n t c u s t o m e r t h a n t o a c q u i r e a n e w o n e . F o r s t a r t u p s , i t ’ s a l s o a k e y i n d i c a t o r o f p r o d u c t - m a r k e t f i t .
A d d i t i o n a l l y , y o u r p r o d u c t e v o l v e s o v e r t i m e . N e w f e a t u r e s a r e a d d e d a n d r e m o v e d , t h e d e s i g n c h a n g e s , e t c . O b s e r v i n g i n d i v i d u a l g r o u p s o v e r t i m e i s a s t a r t i n g p o i n t t o u n d e r s t a n d i n g h o w t h e s e c h a n g e s a f f e c t u s e r b e h a v i o r .
I t ’ s a l s o a g o o d w a y t o v i s u a l i z e y o u r u s e r r e t e n t i o n / c h u r n a s w e l l a s f o r m u l a t i n g a b a s i c u n d e r s t a n d i n g o f t h e i r l i f e t i m e v a l u e .
A n e x a m p l e
I m a g i n e w e h a v e a d a t a s e t l i k e t h e o n e b e l o w ( y o u c a n f i n d i t h e r e ) :
OrderId
OrderDate
UserId
TotalCharges
CommonId
PupId
PickupDate
0
262
2009-01-11
47
50.67
TRQKD
2
2009-01-12
1
278
2009-01-20
47
26.60
4HH2S
3
2009-01-20
2
294
2009-02-03
47
38.71
3TRDC
2
2009-02-04
3
301
2009-02-06
47
53.38
NGAZJ
2
2009-02-09
4
302
2009-02-06
47
14.28
FFYHD
2
2009-02-09
P r e t t y s t a n d a r d p u r c h a s e d a t a w i t h I D s f o r t h e o r d e r a n d u s e r , a s w e l l a s t h e o r d e r d a t e a n d p u r c h a s e a m o u n t .
W e w a n t t o g o f r o m t h e d a t a a b o v e t o s o m e t h i n g l i k e t h i s :
H e r e ’ s h o w w e g e t t h e r e .
C o d e
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
pd . set_option ( 'max_columns' , 50 )
mpl . rcParams [ 'lines.linewidth' ] = 2
% matplotlib inline
df = pd . read_excel ( '/Users/gjreda/Dropbox/datasets/relay-foods.xlsx' )
df . head ( 3 )
OrderId
OrderDate
UserId
TotalCharges
CommonId
PupId
PickupDate
0
262
2009-01-11
47
50.67
TRQKD
2
2009-01-12
1
278
2009-01-20
47
26.60
4HH2S
3
2009-01-20
2
294
2009-02-03
47
38.71
3TRDC
2
2009-02-04
1 . C r e a t e a p e r i o d c o l u m n b a s e d o n t h e O r d e r D a t e
S i n c e w e ' r e d o i n g m o n t h l y c o h o r t s , w e ' l l b e l o o k i n g a t t h e t o t a l m o n t h l y b e h a v i o r o f o u r u s e r s . T h e r e f o r e , w e d o n ' t w a n t g r a n u l a r O r d e r D a t e d a t a ( r i g h t n o w ) .
df [ 'OrderPeriod' ] = df . OrderDate . apply ( lambda x : x . strftime ( '%Y-%m' ))
df . head ()
OrderId
OrderDate
UserId
TotalCharges
CommonId
PupId
PickupDate
OrderPeriod
0
262
2009-01-11
47
50.67
TRQKD
2
2009-01-12
2009-01
1
278
2009-01-20
47
26.60
4HH2S
3
2009-01-20
2009-01
2
294
2009-02-03
47
38.71
3TRDC
2
2009-02-04
2009-02
3
301
2009-02-06
47
53.38
NGAZJ
2
2009-02-09
2009-02
4
302
2009-02-06
47
14.28
FFYHD
2
2009-02-09
2009-02
2 . D e t e r m i n e t h e u s e r ' s c o h o r t g r o u p ( b a s e d o n t h e i r f i r s t o r d e r )
C r e a t e a n e w c o l u m n c a l l e d C o h o r
t G r o u p
, w h i c h i s t h e y e a r a n d m o n t h i n w h i c h t h e u s e r ' s f i r s t p u r c h a s e o c c u r r e d .
df . set_index ( 'UserId' , inplace = True )
df [ 'CohortGroup' ] = df . groupby ( level = 0 )[ 'OrderDate' ] . min () . apply ( lambda x : x . strftime ( '%Y-%m' ))
df . reset_index ( inplace = True )
df . head ()
UserId
OrderId
OrderDate
TotalCharges
CommonId
PupId
PickupDate
OrderPeriod
CohortGroup
0
47
262
2009-01-11
50.67
TRQKD
2
2009-01-12
2009-01
2009-01
1
47
278
2009-01-20
26.60
4HH2S
3
2009-01-20
2009-01
2009-01
2
47
294
2009-02-03
38.71
3TRDC
2
2009-02-04
2009-02
2009-01
3
47
301
2009-02-06
53.38
NGAZJ
2
2009-02-09
2009-02
2009-01
4
47
302
2009-02-06
14.28
FFYHD
2
2009-02-09
2009-02
2009-01
3 . R o l l u p d a t a b y C o h o r t G r o u p & O r d e r P e r i o d
S i n c e w e ' r e l o o k i n g a t m o n t h l y c o h o r t s , w e n e e d t o a g g r e g a t e u s e r s , o r d e r s , a n d a m o u n t s p e n t b y t h e C o h o r t G r o u p w i t h i n t h e m o n t h ( O r d e r P e r i o d ) .
grouped = df . groupby ([ 'CohortGroup' , 'OrderPeriod' ])
# count the unique users, orders, and total revenue per Group + Period
cohorts = grouped . agg ({ 'UserId' : pd . Series . nunique ,
'OrderId' : pd . Series . nunique ,
'TotalCharges' : np . sum })
# make the column names more meaningful
cohorts . rename ( columns = { 'UserId' : 'TotalUsers' ,
'OrderId' : 'TotalOrders' }, inplace = True )
cohorts . head ()
TotalOrders
TotalUsers
TotalCharges
CohortGroup
OrderPeriod
2009-01
2009-01
30
22
1850.255
2009-02
25
8
1351.065
2009-03
26
10
1357.360
2009-04
28
9
1604.500
2009-05
26
10
1575.625
4 . L a b e l t h e C o h o r t P e r i o d f o r e a c h C o h o r t G r o u p
W e w a n t t o l o o k a t h o w e a c h c o h o r t h a s b e h a v e d i n t h e m o n t h s f o l l o w i n g t h e i r f i r s t p u r c h a s e , s o w e ' l l n e e d t o i n d e x e a c h c o h o r t t o t h e i r f i r s t p u r c h a s e m o n t h . F o r e x a m p l e , C o h o r t P e r i o d = 1 w i l l b e t h e c o h o r t ' s f i r s t m o n t h , C o h o r t P e r i o d = 2 i s t h e i r s e c o n d , a n d s o o n .
T h i s a l l o w s u s t o c o m p a r e c o h o r t s a c r o s s v a r i o u s s t a g e s o f t h e i r l i f e t i m e .
def cohort_period ( df ):
"""
Creates a `CohortPeriod` column, which is the Nth period based on the user's first purchase.
Example
-------
Say you want to get the 3rd month for every user:
df.sort(['UserId', 'OrderTime', inplace=True)
df = df.groupby('UserId').apply(cohort_period)
df[df.CohortPeriod == 3]
"""
df [ 'CohortPeriod' ] = np . arange ( len ( df )) + 1
return df
cohorts = cohorts . groupby ( level = 0 ) . apply ( cohort_period )
cohorts . head ()
TotalOrders
TotalUsers
TotalCharges
CohortPeriod
CohortGroup
OrderPeriod
2009-01
2009-01
30
22
1850.255
1
2009-02
25
8
1351.065
2
2009-03
26
10
1357.360
3
2009-04
28
9
1604.500
4
2009-05
26
10
1575.625
5
5 . M a k e s u r e w e d i d a l l t h a t r i g h t
L e t ' s t e s t d a t a p o i n t s f r o m t h e o r i g i n a l D a t a F r a m e w i t h t h e i r c o r r e s p o n d i n g v a l u e s i n t h e n e w c o h o r t s D a t a F r a m e t o m a k e s u r e a l l o u r d a t a t r a n s f o r m a t i o n s w o r k e d a s e x p e c t e d . A s l o n g a s n o n e o f t h e s e r a i s e a n e x c e p t i o n , w e ' r e g o o d .
x = df [( df . CohortGroup == '2009-01' ) & ( df . OrderPeriod == '2009-01' )]
y = cohorts . ix [( '2009-01' , '2009-01' )]
assert ( x [ 'UserId' ] . nunique () == y [ 'TotalUsers' ])
assert ( x [ 'TotalCharges' ] . sum () . round ( 2 ) == y [ 'TotalCharges' ] . round ( 2 ))
assert ( x [ 'OrderId' ] . nunique () == y [ 'TotalOrders' ])
x = df [( df . CohortGroup == '2009-01' ) & ( df . OrderPeriod == '2009-09' )]
y = cohorts . ix [( '2009-01' , '2009-09' )]
assert ( x [ 'UserId' ] . nunique () == y [ 'TotalUsers' ])
assert ( x [ 'TotalCharges' ] . sum () . round ( 2 ) == y [ 'TotalCharges' ] . round ( 2 ))
assert ( x [ 'OrderId' ] . nunique () == y [ 'TotalOrders' ])
x = df [( df . CohortGroup == '2009-05' ) & ( df . OrderPeriod == '2009-09' )]
y = cohorts . ix [( '2009-05' , '2009-09' )]
assert ( x [ 'UserId' ] . nunique () == y [ 'TotalUsers' ])
assert ( x [ 'TotalCharges' ] . sum () . round ( 2 ) == y [ 'TotalCharges' ] . round ( 2 ))
assert ( x [ 'OrderId' ] . nunique () == y [ 'TotalOrders' ])
U s e r R e t e n t i o n b y C o h o r t G r o u p
W e w a n t t o l o o k a t t h e p e r c e n t a g e c h a n g e o f e a c h C o h o r t G r o u p o v e r t i m e - - n o t t h e a b s o l u t e c h a n g e .
T o d o t h i s , w e ' l l f i r s t n e e d t o c r e a t e a p a n d a s S e r i e s c o n t a i n i n g e a c h C o h o r t G r o u p a n d i t s s i z e .
# reindex the DataFrame
cohorts . reset_index ( inplace = True )
cohorts . set_index ([ 'CohortGroup' , 'CohortPeriod' ], inplace = True )
# create a Series holding the total size of each CohortGroup
cohort_group_size = cohorts [ 'TotalUsers' ] . groupby ( level = 0 ) . first ()
cohort_group_size . head ()
CohortGroup
2009-01 22
2009-02 15
2009-03 13
2009-04 39
2009-05 50
Name: TotalUsers, dtype: int64
N o w , w e ' l l n e e d t o d i v i d e t h e T o
t a l U s e r s
v a l u e s i n c o h o r t s
by c o h o r
t _ g r o u p _ s i z e
. S i n c e D a t a F r a m e o p e r a t i o n s a r e p e r f o r m e d b a s e d o n t h e i n d i c e s o f t h e o b j e c t s , w e ' l l u s e u n s t a c k
o n o u r c o h o r t s D a t a F r a m e t o c r e a t e a m a t r i x w h e r e e a c h c o l u m n r e p r e s e n t s a C o h o r t G r o u p a n d e a c h r o w i s t h e C o h o r t P e r i o d c o r r e s p o n d i n g t o t h a t g r o u p .
T o i l l u s t r a t e w h a t u n s t a c k
d o e s , r e c a l l t h e f i r s t f i v e T o t a l U s e r
s
v a l u e s :
cohorts [ 'TotalUsers' ] . head ()
CohortGroup CohortPeriod
2009-01 1 22
2 8
3 10
4 9
5 10
Name: TotalUsers, dtype: int64
cohorts [ 'TotalUsers' ] . unstack ( 0 ) . head ()
CohortGroup
2009-01
2009-02
2009-03
2009-04
2009-05
2009-06
2009-07
2009-08
2009-09
2009-10
2009-11
2009-12
2010-01
2010-02
2010-03
CohortPeriod
1
22
15
13
39
50
32
50
31
37
54
130
65
95
100
24
2
8
3
4
13
13
15
23
11
15
17
32
17
50
19
NaN
3
10
5
5
10
12
9
13
9
14
12
26
18
26
NaN
NaN
4
9
1
4
13
5
6
10
7
8
13
29
7
NaN
NaN
NaN
5
10
4
1
6
4
7
11
6
13
13
13
NaN
NaN
NaN
NaN
N o w , w e c a n u t i l i z e b r o a d c a s t i n g t o d i v i d e e a c h c o l u m n b y t h e c o r r e s p o n d i n g c o h o r t _ g r o u p _ s i z e
.
T h e r e s u l t i n g D a t a F r a m e , u s e r _ r e
t e n t i o n
, c o n t a i n s t h e p e r c e n t a g e o f u s e r s f r o m t h e c o h o r t p u r c h a s i n g w i t h i n t h e g i v e n p e r i o d . F o r i n s t a n c e , 3 8 . 4 % o f u s e r s i n t h e 2 0 0 9 - 0 3 p u r c h a s e d a g a i n i n m o n t h 3 ( w h i c h w o u l d b e M a y 2 0 0 9 ) .
user_retention = cohorts [ 'TotalUsers' ] . unstack ( 0 ) . divide ( cohort_group_size , axis = 1 )
user_retention . head ( 10 )
CohortGroup
2009-01
2009-02
2009-03
2009-04
2009-05
2009-06
2009-07
2009-08
2009-09
2009-10
2009-11
2009-12
2010-01
2010-02
2010-03
CohortPeriod
1
1.000000
1.000000
1.000000
1.000000
1.00
1.00000
1.00
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.00
1
2
0.363636
0.200000
0.307692
0.333333
0.26
0.46875
0.46
0.354839
0.405405
0.314815
0.246154
0.261538
0.526316
0.19
NaN
3
0.454545
0.333333
0.384615
0.256410
0.24
0.28125
0.26
0.290323
0.378378
0.222222
0.200000
0.276923
0.273684
NaN
NaN
4
0.409091
0.066667
0.307692
0.333333
0.10
0.18750
0.20
0.225806
0.216216
0.240741
0.223077
0.107692
NaN
NaN
NaN
5
0.454545
0.266667
0.076923
0.153846
0.08
0.21875
0.22
0.193548
0.351351
0.240741
0.100000
NaN
NaN
NaN
NaN
6
0.363636
0.266667
0.153846
0.179487
0.12
0.15625
0.20
0.258065
0.243243
0.129630
NaN
NaN
NaN
NaN
NaN
7
0.363636
0.266667
0.153846
0.102564
0.06
0.09375
0.22
0.129032
0.216216
NaN
NaN
NaN
NaN
NaN
NaN
8
0.318182
0.333333
0.230769
0.153846
0.10
0.09375
0.14
0.129032
NaN
NaN
NaN
NaN
NaN
NaN
NaN
9
0.318182
0.333333
0.153846
0.051282
0.10
0.31250
0.14
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
10
0.318182
0.266667
0.076923
0.102564
0.08
0.09375
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
F i n a l l y , w e c a n p l o t t h e c o h o r t s o v e r t i m e i n a n e f f o r t t o s p o t b e h a v i o r a l d i f f e r e n c e s o r s i m i l a r i t i e s . T w o c o m m o n c o h o r t c h a r t s a r e l i n e g r a p h s a n d h e a t m a p s , b o t h o f w h i c h a r e s h o w n b e l o w .
N o t i c e t h a t t h e f i r s t p e r i o d o f e a c h c o h o r t i s 1 0 0 % - - t h i s i s b e c a u s e o u r c o h o r t s a r e b a s e d o n e a c h u s e r ' s f i r s t p u r c h a s e , m e a n i n g e v e r y o n e i n t h e c o h o r t p u r c h a s e d i n m o n t h 1 .
user_retention [[ '2009-06' , '2009-07' , '2009-08' ]] . plot ( figsize = ( 10 , 5 ))
plt . title ( 'Cohorts: User Retention' )
plt . xticks ( np . arange ( 1 , 12.1 , 1 ))
plt . xlim ( 1 , 12 )
plt . ylabel ( ' % o f Cohort Purchasing' );
# Creating heatmaps in matplotlib is more difficult than it should be.
# Thankfully, Seaborn makes them easy for us.
# http://stanford.edu/~mwaskom/software/seaborn/
import seaborn as sns
sns . set ( style = 'white' )
plt . figure ( figsize = ( 12 , 8 ))
plt . title ( 'Cohorts: User Retention' )
sns . heatmap ( user_retention . T , mask = user_retention . T . isnull (), annot = True , fmt = '.0%' );
U n s u r p r i s i n g l y , w e c a n s e e f r o m t h e a b o v e c h a r t t h a t f e w e r u s e r s t e n d t o p u r c h a s e a s t i m e g o e s o n .
H o w e v e r , w e c a n a l s o s e e t h a t t h e 2 0 0 9 - 0 1 c o h o r t i s t h e s t r o n g e s t , w h i c h e n a b l e s u s t o a s k t a r g e t e d q u e s t i o n s a b o u t t h i s c o h o r t c o m p a r e d t o o t h e r s - - w h a t o t h e r a t t r i b u t e s ( b e s i d e s f i r s t p u r c h a s e m o n t h ) d o t h e s e u s e r s s h a r e w h i c h m i g h t b e c a u s i n g t h e m t o s t i c k a r o u n d ? H o w w e r e t h e m a j o r i t y o f t h e s e u s e r s a c q u i r e d ? W a s t h e r e a s p e c i f i c m a r k e t i n g c a m p a i g n t h a t b r o u g h t t h e m i n ? D i d t h e y t a k e a d v a n t a g e o f a p r o m o t i o n a t s i g n - u p ? T h e a n s w e r s t o t h e s e q u e s t i o n s w o u l d i n f o r m f u t u r e m a r k e t i n g a n d p r o d u c t e f f o r t s .
F u r t h e r w o r k
U s e r r e t e n t i o n i s o n l y o n e w a y o f u s i n g c o h o r t s t o l o o k a t y o u r b u s i n e s s — w e c o u l d h a v e a l s o l o o k e d a t r e v e n u e r e t e n t i o n . T h a t i s , t h e p e r c e n t a g e o f e a c h c o h o r t ’ s m o n t h 1 r e v e n u e r e t u r n i n g i n s u b s e q u e n t p e r i o d s . U s e r r e t e n t i o n i s i m p o r t a n t , b u t w e s h o u l d n ’ t l o s e s i g h t o f t h e r e v e n u e e a c h c o h o r t i s b r i n g i n g i n ( a n d h o w m u c h o f i t i s r e t u r n i n g ) .
H o p e f u l l y y o u ’ v e f o u n d t h i s p o s t u s e f u l . I f I ’ v e m i s s e d a n y t h i n g , l e t m e k n o w .
A d d i t i o n a l R e s o u r c e s
● C o h o r t A n a l y s i s o n W i k i p e d i a
● K n o w Y o u r U s e r C o h o r t s b y C h r i s t o p h J a n z
● T h e C o h o r t A n a l y s i s b y F r e d W i l s o n ( U n i o n S q u a r e V e n t u r e s )
● W h a t e x a c t l y i s c o h o r t a n a l y s i s ? o n Q u o r a
1 . W h i l e a p u r c h a s e m i g h t n o t b e a t t h e c o r e o f t h e s e b u s i n e s s e s , t h e y s t i l l m i g h t o c c u r ( e . g . " B u y " b u t t o n s o n t w e e t s a r e o f v a l u e t o T w i t t e r , b u t u s e r s a n d e n g a g e m e n t a r e w h a t t h e p l a t f o r m i s a b o u t ) .
● H o m e
● A b o u t
● W r i t i n g
● C o n t a c t
● R S S
● N e w s l e t t e r
● L i n k e d I n
● G i t h u b