(Dprep 2024) When Do Discounts Matter? An Investigation of Potential Drivers of Discounts Elasticities Across Brands, Categories and Store Formats
Tanetpong (Ned) Choungprayoon
This repository is a partial replication of the complementary repository of my empirical research project conducted as part of my doctoral dissertation. It serves as a reproducible, end-to-end workflow on GitHub, featuring a well-structured and fully automated pipeline for data exploration & cleaning, analysis, and deployment. This repository is submitted as a project for Data Preparation and Workflow Management course at Tilburg University for Spring 2024 instructed by Hannes Datta.
Most quantitative research papers model the effect of price promotions as the effect of changes in the final retail price while consumer behavior research suggests the potential framing phenomena in which customers can evaluate retail price and discounts differently. By taking this behavioral assumption in the sales-response model, we can estimate the discounts’ elasticities apart from price elasticities and investigate its systematic drivers from brand factors, category factors and store formats using second-stage regression. Using replicable simulated data (as I cannot share an actual data), I attempted to show that discounts are not equally effective across brands, categories and store format and I can estimate their differential effects.
The purpose of this repository is to serve as a reproducible, end-to-end workflow on GitHub, featuring a well-structured and fully automated pipeline for data exploration & cleaning, analysis, and deployment. Besides a walkthrough illustrated in this repository, I provide an R code that can generate pdf report of this project.
data
contains df.csv which is simulated data, similar to our data structure (i.e. scanner data), used for the walkthrough and report and Simulate_Data.R illustrating how data is simulated
src
contains R code used for transforming and aggregating data (data-preparation.R), auditing data (audit.R), running first-stage analysis (firststage.R) and second-stage analysis (secondstage.R) and generating report for the project (gen_report.R)
gen
contains data, images and results generated from source codes
img
contains visualization used in the walkthrough
├── data
├── Simulate_Data.R
├── df.csv
├── gen
├──firststage
├──secondstage
├── data-preparation
├── audit
└── report.pdf
└── src
├── analysis
├──firststage.R
├──secondstage.R
├──report.Rmd
├──gen_report.Rmd
├── data-preparation.R
└── audit.R
├── .gitignore
├── README.md
├── makefile
- R
- R Markdown, R script
- Install R required package
install.packages("data.table") install.packages("ggplot2")
- Gnu Make
- Makefile
- Git Bash
- GitHub
- Fork this repository
- Type "make" in the command prompt and run
- Run manually
- ../data/Simulated_Data.R to simulate data for this project
- ../src/data-preparation.R to clean, filter and aggregate data
- ../src/audit.R to audit raw and aggregate data
- ../src/analysis/firststage.R to run first-stage analysis
- ../src/analysis/secondstage.R to run second-stage analysis
- ../src/analysis/gen_report.R to generate final report in pdf format
For this project, we mainly use data.table
for data aggregation and ggplot2
for visualization. Honestly, most of the work involves data transformation and operationalization (i.e., transforming customer scanner data into brand, category and store-format data)
Load required package
library(data.table)
library(ggplot2)
Setting: We simulated scanner data from 300 customers shopping across formats (hypermarket, supermarket and conveniece store) imposing positive relationship between discounts and quantity sold and negative relationship between regular price and quantity sold. For simplicity, we walkthrough showing example of one category (category z). In category z, there are 5 brands available including "Brand A", "Brand B", "Brand C", "Brand D" and "Brand E".
Import data
df <- fread("../data/df.csv")
Variable | Description |
---|---|
cust | Customer id |
week_nr | Week number |
spend | Total money spent |
brand | Brand purchased |
format | Store format |
category | Category z |
regprice | Regular prices (list price) |
discount | Discounts offered |
quantity | Quantity purchased |
holiday | = 1 if week_nr is holiday, otherwise 0 |
As we are interested in aggregate aspect (effectiveness of brands' discounts and potential systematic determinants from brand factor, category factor and store format), we need to transform by aggregating and operationalizing relevant variables.
Generate (market-weighted average) brand variables across store formats
df[, finalspending := regprice - discount]
df[, regpriceperunit := regprice/quantity]
df[, finalpriceperunit := finalspending /quantity]
df[, discountperunit := discount/quantity]
# Calculate weekly sales by brand
weeklysalesbybrand <- df[, .(totalsales = sum(quantity), LL = .N), by = .(week_nr,holiday, brand, format)]
#LL can be calculated by uniqueN(product_id) to count distinct product id
# Merge df_complete with weeklysalesbybrand and calculate weights
df_weightcalc <- merge(df, weeklysalesbybrand, by = c("week_nr","holiday", "brand", "format"))
df_weightcalc[, wp := quantity/totalsales]
# Calculate weighted averages
df_weightcalc[, avgregpriceperunit := regpriceperunit * wp]
df_weightcalc[, avgfinalprice := finalpriceperunit * wp]
df_weightcalc[, avgdiscount := discountperunit * wp]
# Aggregate by week_nr, brand, and format
df_bybrand <- df_weightcalc[, .(totalvolume = sum(quantity),
totalvalue = sum(finalspending),
LL = mean(LL),
avgregprice = sum(avgregpriceperunit),
avgfinalprice = sum(avgfinalprice),
avgdiscount = sum(avgdiscount)), by = .(week_nr,holiday, brand, format)]
rm(weeklysalesbybrand,df_weightcalc)
df_bybrand[, depth := avgdiscount/avgregprice]
Calculate (1) (market-weighted average) competitor brand variables (2) lag variable (3) Gaussian-copula correction term and (4) first difference of variable for estimation across store formats
#Calculate (market-weighted average) related variables of competitors and lag variable
#We need for loop to construct competitor of each brand, specify further
distinctformat<-unique(as.factor(df_bybrand$format))
distinctbrand<-unique(as.factor(df_bybrand$brand))
distinctweek<-unique(as.factor(df_bybrand$week_nr))
nbrand<-length(distinctbrand)
nweek<-length(distinctweek)
nformat <- length(distinctformat)
#Set up function for calculate lag variable
lag_1 <- function(x, k = 1) head(c(rep(NA, k), x), length(x))
#Function for copula correction term following Park and Gupta 2012 see more https://github.com/hannesdatta/marketingtools
make_copula <- function(x) {
if (length(unique(x)) == 1) return(as.numeric(rep(NA, length(x))))
return(ifelse(ecdf(x)(x) == 1, qnorm(1 - .0000001), qnorm(ecdf(x)(x))))
}
#We calculate (market-weighted average) competitors info for each brand
gen_comp_lag_format <- function(df_bybrand, distinctbrand, format, nbrand, nweek) {
results_list <- list()
for (i in 1:length(distinctbrand)) {
# Creating a copy of the relevant subset of df_bybrand
df_own = copy(df_bybrand[brand == distinctbrand[i] & format == format])
df_competitor = df_bybrand[brand != distinctbrand[i] & format == format]
# create copula to mitigate potential endogeneity by format and brand
df_own[, `:=` (
cop_avgregprice = make_copula(avgregprice),
cop_avgdiscount = make_copula(avgdiscount),
cop_avgfinalprice = make_copula(avgfinalprice)
)]
# Calculate weekly sales by competing brand
weeklysalesbycompetingbrand = df_competitor[, .(totalmarketvolume = sum(totalvolume)), by = .(week_nr)]
# Calculate competitor weight and weighted averages
df_competitorweightcalc = df_competitor[weeklysalesbycompetingbrand, on = "week_nr"]
df_competitorweightcalc[, wp := totalvolume / totalmarketvolume]
df_competitorweightcalc[, `:=` (
avgcompLL = LL * wp,
avgcompregprice = avgregprice * wp,
avgcompfinalprice = avgfinalprice * wp,
avgcompdiscount = avgdiscount * wp
)]
# Summarize competitor info
df_competitorinfo = df_competitorweightcalc[, .(
avgcompLL = sum(avgcompLL),
avgcompregprice = sum(avgcompregprice),
avgcompfinalprice = sum(avgcompfinalprice),
avgcompdiscount = sum(avgcompdiscount)
), by = .(week_nr)]
# Merge with own brand data
df_own = df_own[df_competitorinfo, on = "week_nr"]
# Add lagged variables
df_own <- df_own[order(week_nr)] #Make sure that week_nr is ordered correctly before applying the lag
lag_vars = c("totalvolume", "totalvalue", "LL", "avgfinalprice", "avgregprice", "avgdiscount", "avgcompLL", "avgcompfinalprice", "avgcompregprice", "avgcompdiscount")
for (var in lag_vars) {
df_own[, paste0(var, "1") := shift(get(var), 1, type = "lag"), by = .(brand, format)]
}
df_own <- na.omit(df_own)
# Store the processed df_own in the results list
results_list[[i]] <- df_own
}
# Combine all processed df_own into one data.table
return(rbindlist(results_list, use.names = TRUE))
}
df_bybrand_formats <- gen_comp_lag_format(df_bybrand, distinctbrand, format, nbrand, nweek)
# calculate first difference for estimation
df_bybrand_formats[, `:=` (
dtotalvolume = totalvolume - totalvolume1,
davgfinalprice = avgfinalprice - avgfinalprice1,
davgregprice = avgregprice - avgregprice1,
davgdiscount =avgdiscount - avgdiscount1,
dLL = LL - LL1,
davgcompLL = avgcompLL - avgcompLL1,
davgcompfinalprice = avgcompfinalprice - avgcompfinalprice1,
davgcompdiscount = avgcompdiscount - avgcompdiscount1
)]
}
df_bybrand_formats <- gen_comp_lag_format(df_bybrand, distinctbrand, format, nbrand, nweek)
# calculate first difference for estimation
df_bybrand_formats[, `:=` (
dtotalvolume = totalvolume - totalvolume1,
davgfinalprice = avgfinalprice - avgfinalprice1,
davgregprice = avgregprice - avgregprice1,
davgdiscount =avgdiscount - avgdiscount1,
dLL = LL - LL1,
davgcompLL = avgcompLL - avgcompLL1,
davgcompfinalprice = avgcompfinalprice - avgcompfinalprice1,
davgcompdiscount = avgcompdiscount - avgcompdiscount1
)]
Plot relationship between quantity sold and regular prices and discounts by brand
ggplot(df_bybrand_formats[brand == "A",], aes(x = davgdiscount, y = dtotalvolume, color = format)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(title = "Effect of discounts on sales for brand A", x = "Change of discounts", y = "Change of volume sold", color = "Format")
ggplot(df_bybrand_formats[format == "supermarket",], aes(x = davgdiscount, y = dtotalvolume, color = brand)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(title = "Effect of discounts on sales in supermarket",x = "Change of discounts", y = "Change of volume sold", color = "Brand")
- Varying effect of price and discounts across brands and formats
As we are interested in how brand factors, category factors and store format influence discount elasticity (see framework below), we need to calculate relevant variables regarding brand and category.
FrameworkCalculate brand level factors
# Calculate weekly market by week
weeklymarketbyweek <- df[, .(totalformatsales = sum(quantity), totalformatLL = .N), by = .(week_nr, format)]
# Calculate weekly market share by brand
weeklymarketsharebybrand <- df[, .(totalsales = sum(quantity)), by = .(week_nr, brand, format)]
# Merge using data.table join
weeklymarketsharemerge <- weeklymarketsharebybrand[weeklymarketbyweek, on = .(week_nr, format)]
# Calculate brand market share format
brandmarketshareformat <- weeklymarketsharemerge[, .(avgbrandweeksales = mean(totalsales), avgmarketshare = mean(totalsales / totalformatsales)), by = .(brand, format)]
# Filter by brands with discounts
df_bybrand_discount <- df_bybrand[avgdiscount > 0]
# Calculate brand discount depth
branddiscountdepth <- df_bybrand_discount[, .(avgbranddiscount = mean(avgdiscount), avgbranddiscountdepth = mean(avgdiscount / avgregprice)), by = .(brand, format)]
# Filter by units with discounts
df_discount <- df[discountperunit > 0]
# Calculate LL discount
LLdiscount <- df_discount[, .(LLdiscount = .N), by = .(week_nr, brand, format)]
# Merge and calculate discount breadth
LLdiscountmerge <- LLdiscount[df_bybrand_discount, on = .(week_nr, brand, format)]
LLdiscountmerge[, discountbreadth := LLdiscount / LL]
# Calculate brand discount breadth
branddiscountbreadth <- LLdiscountmerge[, .(avgbranddiscountbreath = mean(discountbreadth)), by = .(brand, format)]
# Filter for high discounts
df_highdiscount <- df[discountperunit / regpriceperunit > 0.05]
# Count high discounts
counthighdiscount <- df_highdiscount[, .(SKUdiscount = .N), by = .(week_nr, brand, format)]
counthighdiscount[, weekdisc := 1]
# Calculate brand discount frequency
branddiscountfrequency <- counthighdiscount[, .(frequency = sum(weekdisc)), by = .(brand, format)]
# Merge to create brand characteristics
brandcharacteristic <- merge(merge(merge(brandmarketshareformat, branddiscountdepth, by = c("format", "brand")), branddiscountbreadth, by = c("format", "brand")), branddiscountfrequency, by = c("format", "brand"))
rm(weeklymarketsharebybrand, weeklymarketsharemerge, df_bybrand_discount, LLdiscount, LLdiscountmerge, df_highdiscount, counthighdiscount,brandmarketshareformat, branddiscountdepth,branddiscountbreadth,branddiscountfrequency)
format | brand | Average Discounts | Discount Depth | Discount Breadth | Frequency |
---|---|---|---|---|---|
convenience | A | 2.97 | 0.17 | 0.58 | 113 |
convenience | B | 2.96 | 0.16 | 0.60 | 97 |
convenience | C | 3.01 | 0.18 | 0.61 | 102 |
convenience | D | 3.07 | 0.18 | 0.65 | 104 |
convenience | E | 2.90 | 0.16 | 0.63 | 113 |
hypermarket | A | 2.93 | 0.17 | 0.66 | 108 |
hypermarket | B | 3.17 | 0.18 | 0.61 | 108 |
hypermarket | C | 3.05 | 0.18 | 0.67 | 99 |
hypermarket | D | 3.06 | 0.18 | 0.68 | 111 |
hypermarket | E | 3.21 | 0.19 | 0.66 | 115 |
supermarket | A | 3.23 | 0.20 | 0.66 | 112 |
supermarket | B | 3.11 | 0.19 | 0.67 | 112 |
supermarket | C | 3.14 | 0.19 | 0.65 | 114 |
supermarket | D | 2.94 | 0.17 | 0.64 | 106 |
supermarket | E | 2.94 | 0.17 | 0.58 | 111 |
Calculate category level factors across formats
# Calculate category depth
categorydepth <- df_discount[, .(avgcatediscdept = mean(discountperunit/ regpriceperunit)), by = .(format)]
# Calculate category deal
categorydeal <- df_discount[, .(totaldiscvolume = sum(quantity)), by = .(week_nr, format)]
# Calculate category total
categorytotal <- df[, .(totalvolume = sum(quantity)), by = .(week_nr, format)]
# Calculate category proportion using data.table join
categoryproportion <- categorytotal[categorydeal, on = .(format, week_nr)]
#categoryproportion[, totaldiscvolume := fcoalesce(totaldiscvolume, 0)] in case there is NA
# Calculate category proportion format
categoryproportionformat <- categoryproportion[, .(averagedealprop = mean(totaldiscvolume / totalvolume)), by = .(format)]
# Calculate category breadth week
categorybreadthweek <- df_discount[, .(Ncatediscount = .N), by = .(week_nr, format)]
# Merge and calculate category breadth using data.table join
categorybreadthweekmerge <- categorybreadthweek[weeklymarketbyweek, on = .(week_nr, format)]
categorybreadth <- categorybreadthweekmerge[, .(avgcatediscbreath = mean(Ncatediscount / totalformatLL)), by = .(format)]
# Calculate category competition structure
categorycompstructure <- brandcharacteristic[, .(varmarketshare = var(avgmarketshare)), by = .(format)]
# Combine all category characteristics
categorycharacteristic <- Reduce(function(x, y) merge(x, y, by = "format"), list(categorydepth, categorybreadth, categoryproportionformat, categorycompstructure))
# Optionally, remove intermediate variables
rm(weeklymarketbyweek,categorydeal, categorytotal, categoryproportion, categoryproportionformat, categorydepth, categorybreadthweek, categorybreadthweekmerge, categorybreadth, categorycompstructure, df, df_bybrand,df_discount)
Format | Average Category Discount Depth | Average Category Discount Breadth | Proportion of Discounts | Market Competitiveness |
---|---|---|---|---|
convenience | 0.29 | 0.49 | 0.65 | 0.00 |
hypermarket | 0.25 | 0.50 | 0.66 | 0.00 |
supermarket | 0.30 | 0.51 | 0.66 | 0.00 |
We estimate discount elasticity across brands, categories and formats by employing an error-correction specification (Datta, van Heerde, Dekimpe, & Steenkamp, 2022) as our sales-response model;
- The immediate effect of discounts is captured by
$\beta_{1'i,j,k}$
First-Stage regression
#Run analysis by format
# Function to perform linear regression and extract coefficients
MM_lm <- function(df, selected_brand, selected_format) {
lm_model <- lm(dtotalvolume ~ dLL + davgregprice + davgdiscount + davgcompLL +
davgcompfinalprice + davgcompdiscount + totalvolume1 + LL1 +
avgregprice1 + avgdiscount1 + holiday + cop_avgregprice + cop_avgdiscount,
data = df[brand == selected_brand & format == selected_format])
coeffs <- summary(lm_model)$coefficients
data.table(brand = selected_brand, format = selected_format,
Short_Coef_Discount = coeffs["davgdiscount", 1],
Short_Std_Discount = coeffs["davgdiscount", 2],
Long_Coef_Discount = coeffs["avgdiscount1", 1],
Long_Std_Discount = coeffs["avgdiscount1", 2])
}
# Initialize an empty data.table
df_linear_withdiscount <- data.table(brand = character(), format = character(),
Short_Coef_Discount = numeric(),
Short_Std_Discount = numeric(),
Long_Coef_Discount = numeric(),
Long_Std_Discount = numeric())
# Loop through each brand and format
for (brand in distinctbrand) {
for (format in distinctformat) {
df_linear_withdiscount <- rbindlist(list(df_linear_withdiscount, MM_lm(df_bybrand_formats, brand, format)), use.names = TRUE)
}
}
- To make the elasticity comparable across categories, we control for scale differences by converting this unit effectiveness of discounts into percentage elasticities at mean (Srinivasan et al., 2004)
Calculate elasticities short term effect by normalizing with brand sales and plot
df_elasticities <- df_linear_withdiscount[brandcharacteristic, on = .(brand, format)]
df_elasticities[, `:=` (
meanelasticity = `Short_Coef_Discount` * (avgbranddiscount / avgbrandweeksales) * 100,
sdelasticity = `Short_Std_Discount` * (avgbranddiscount / avgbrandweeksales) * 100
)]
#Merge with category characteristics
#We will finally obtain data from first-stage
df_firststage <- df_elasticities[categorycharacteristic, on = "format"]
# Create a scatter plot for results
df_firststage[, observation := .I]
ggplot(df_firststage, aes(x = observation, y = meanelasticity, color = brand, shape = format)) +
geom_point(size = 4) +
theme_minimal() +
labs(x = "Observation Number", y = "Mean Elasticity", color = "Brand", shape = "Format")
- For this simulated dataset, Brand A seems to have relatively high discount effectiveness over other brands (in supermarket and hypermarket) and most effective in supermarket
To see how brand factors, category factors and store formats influence discount effectiveness, we estimate discounts elasticities at mean
Second-Stage regression
df <- fread("../gen/df_firststage.csv")
#Assume brand C,E is private label
df[, PL := as.integer(brand %in% c('C', 'E'))]
#For consistency we multiplied all ratio to 100 so its easy to inpret to our coefficient
df[, `:=` (
avgmarketshare = avgmarketshare * 100,
avgbranddiscountdepth = avgbranddiscountdepth * 100,
avgbranddiscountbreath = avgbranddiscountbreath * 100,
avgcatediscdept = avgcatediscdept * 100,
avgcatediscbreath = avgcatediscbreath * 100,
averagedealprop = averagedealprop * 100,
varmarketshare = varmarketshare * 100,
format = as.factor(format),
PL = as.factor(PL)
)]
fixedSecondStageLS<- lm(meanelasticity ~ format+avgmarketshare+ avgbranddiscountdepth +avgbranddiscountbreath+ log(frequency) + PL, data = df, weights=(1/sdelasticity))
summary(fixedSecondStageLS)
As we used only one category for this walkthrough, we can't include category factor as there is no variance or it perfectly collinear with store format
Variable | Coefficient |
---|---|
Hypermarket | -19.1804 (12.3323) |
Supermarket | 5.1736 (13.3060) |
BrandDiscountDepth | 5.9182 (5.8236) |
BrandDiscountBreadth | 0.5242 (2.0421) |
log(frequency) | 27.3394 (94.4796) |
Private Label | -8.750 (6.873) |
Constant | -8.5414 (8.3549) |
Note:
*p<0.1; **p<0.05; ***p<0.01
- E.g. supermarket seems to be more effective in term of offering discounts compared to convenience store (and hypermarket)
- E.g. Brand that offers high discounts seem to be more effective
- None of these factors are significant