Skip to content

Customer Analysis

Customer Purchase Behavior Analysis for Retention and Value Optimization.

Business Context

Understanding customer purchase patterns is fundamental to retail success. Some customers make single purchases and never return, while others become loyal repeat buyers. This module analyzes the distribution of purchase frequency to identify customer behavior segments and inform retention strategies.

The Business Problem

Retailers need to understand the relationship between customer purchase frequency and business performance: - What percentage of customers are one-time buyers versus repeat customers? - How does purchase frequency relate to customer lifetime value? - Which customer segments offer the greatest growth opportunities?

Without this analysis, businesses may invest equally in all customers or fail to identify high-potential segments for targeted retention efforts.

Real-World Applications

Customer Retention Strategy

  • Identify the percentage of one-time buyers for targeted reactivation campaigns
  • Segment customers by purchase frequency for differentiated marketing approaches
  • Develop loyalty programs based on actual behavior patterns

Resource Allocation

  • Focus retention efforts on customers showing repeat purchase potential
  • Allocate customer service resources based on customer value segments
  • Optimize marketing spend by targeting high-frequency customer characteristics

Business Performance Monitoring

  • Track changes in purchase frequency distribution over time
  • Monitor the health of customer acquisition versus retention balance
  • Identify shifts in customer behavior that may indicate market changes

This module computes purchase-frequency statistics that can be visualized with the plotting helpers in openretailscience.plots.

DaysBetweenPurchases

Computes the average number of days between purchases per customer.

Attributes:

Name Type Description
purchase_dist_s Series

The average number of days between purchases per customer.

Source code in openretailscience/analysis/customer.py
class DaysBetweenPurchases:
    """Computes the average number of days between purchases per customer.

    Attributes:
        purchase_dist_s (pd.Series): The average number of days between purchases per customer.
    """

    def __init__(self, df: pd.DataFrame) -> None:
        """Initialize the DaysBetweenPurchases class.

        Args:
            df (pd.DataFrame): A dataframe with the transaction data. The dataframe must have the columns customer_id
                and transaction_date, which must be non-null.

        Raises:
            ValueError: If the dataframe doesn't contain the columns customer_id and transaction_date, or if the
                columns are null.

        """
        cols = ColumnHelper()
        required_cols = [cols.customer_id, cols.transaction_date]
        ensure_data_has_columns(df, required_cols)

        self.purchase_dist_s = self._calculate_days_between_purchases(df)

    @staticmethod
    def _calculate_days_between_purchases(df: pd.DataFrame) -> pd.Series:
        """Calculate the average number of days between purchases per customer.

        Args:
            df (pd.DataFrame): A dataframe with the transaction data. The dataframe must have the columns customer_id
                and transaction_date, which must be non-null.

        Returns:
            pd.Series: The average number of days between purchases per customer.
        """
        cols = ColumnHelper()
        purchase_dist_df = df[[cols.customer_id, cols.transaction_date]].copy()
        purchase_dist_df[cols.transaction_date] = df[cols.transaction_date].dt.floor("D")
        purchase_dist_df = purchase_dist_df.drop_duplicates().sort_values([cols.customer_id, cols.transaction_date])
        purchase_dist_df["diff"] = purchase_dist_df[cols.transaction_date].diff()
        new_cust_mask = purchase_dist_df[cols.customer_id] != purchase_dist_df[cols.customer_id].shift(1)
        purchase_dist_df = purchase_dist_df[~new_cust_mask]
        purchase_dist_df["diff"] = purchase_dist_df["diff"].dt.days
        return purchase_dist_df.groupby(cols.customer_id)["diff"].mean()

    def purchases_percentile(self, percentile: float = 0.5) -> float:
        """Get the average number of days between purchases at a given percentile.

        Args:
            percentile (float): The percentile to get the average number of days between purchases at.

        Returns:
            float: The average number of days between purchases at the given percentile.
        """
        return self.purchase_dist_s.quantile(percentile)

__init__(df)

Initialize the DaysBetweenPurchases class.

Parameters:

Name Type Description Default
df DataFrame

A dataframe with the transaction data. The dataframe must have the columns customer_id and transaction_date, which must be non-null.

required

Raises:

Type Description
ValueError

If the dataframe doesn't contain the columns customer_id and transaction_date, or if the columns are null.

Source code in openretailscience/analysis/customer.py
def __init__(self, df: pd.DataFrame) -> None:
    """Initialize the DaysBetweenPurchases class.

    Args:
        df (pd.DataFrame): A dataframe with the transaction data. The dataframe must have the columns customer_id
            and transaction_date, which must be non-null.

    Raises:
        ValueError: If the dataframe doesn't contain the columns customer_id and transaction_date, or if the
            columns are null.

    """
    cols = ColumnHelper()
    required_cols = [cols.customer_id, cols.transaction_date]
    ensure_data_has_columns(df, required_cols)

    self.purchase_dist_s = self._calculate_days_between_purchases(df)

purchases_percentile(percentile=0.5)

Get the average number of days between purchases at a given percentile.

Parameters:

Name Type Description Default
percentile float

The percentile to get the average number of days between purchases at.

0.5

Returns:

Name Type Description
float float

The average number of days between purchases at the given percentile.

Source code in openretailscience/analysis/customer.py
def purchases_percentile(self, percentile: float = 0.5) -> float:
    """Get the average number of days between purchases at a given percentile.

    Args:
        percentile (float): The percentile to get the average number of days between purchases at.

    Returns:
        float: The average number of days between purchases at the given percentile.
    """
    return self.purchase_dist_s.quantile(percentile)

PurchasesPerCustomer

Computes the number of purchases per customer.

Attributes:

Name Type Description
cust_purchases_s Series

The number of purchases per customer.

Source code in openretailscience/analysis/customer.py
class PurchasesPerCustomer:
    """Computes the number of purchases per customer.

    Attributes:
        cust_purchases_s (pd.Series): The number of purchases per customer.
    """

    def __init__(self, df: pd.DataFrame) -> None:
        """Initialize the PurchasesPerCustomer class.

        Args:
            df (pd.DataFrame): A dataframe with the transaction data. The dataframe must comply with the
                contain customer_id and transaction_id columns, which must be non-null.

        Raises:
            ValueError: If the dataframe doesn't contain the columns customer_id and transaction_id, or if the columns
                are null.

        """
        cols = ColumnHelper()
        required_cols = [cols.customer_id, cols.transaction_id]
        ensure_data_has_columns(df, required_cols)

        self.cust_purchases_s = df.groupby(cols.customer_id)[cols.transaction_id].nunique()

    def purchases_percentile(self, percentile: float = 0.5) -> float:
        """Get the number of purchases at a given percentile.

        Args:
            percentile (float): The percentile to get the number of purchases at.

        Returns:
            float: The number of purchases at the given percentile.
        """
        return self.cust_purchases_s.quantile(percentile)

    def find_purchase_percentile(self, number_of_purchases: int, comparison: str = "less_than_equal_to") -> float:
        """Find the percentile of the number of purchases.

        Args:
            number_of_purchases (int): The number of purchases to find the percentile of.
            comparison (str, optional): The comparison to use. Defaults to "less_than_equal_to". Must be one of
                less_than, less_than_equal_to, equal_to, not_equal_to, greater_than, or greater_than_equal_to.

        Returns:
            float: The percentile of the number of purchases.
        """
        ops = {
            "less_than": operator.lt,
            "less_than_equal_to": operator.le,
            "equal_to": operator.eq,
            "not_equal_to": operator.ne,
            "greater_than": operator.gt,
            "greater_than_equal_to": operator.ge,
        }

        if comparison not in ops:
            msg = f"Comparison must be one of {', '.join(repr(k) for k in ops)}"
            raise ValueError(msg)

        return len(self.cust_purchases_s[ops[comparison](self.cust_purchases_s, number_of_purchases)]) / len(
            self.cust_purchases_s,
        )

__init__(df)

Initialize the PurchasesPerCustomer class.

Parameters:

Name Type Description Default
df DataFrame

A dataframe with the transaction data. The dataframe must comply with the contain customer_id and transaction_id columns, which must be non-null.

required

Raises:

Type Description
ValueError

If the dataframe doesn't contain the columns customer_id and transaction_id, or if the columns are null.

Source code in openretailscience/analysis/customer.py
def __init__(self, df: pd.DataFrame) -> None:
    """Initialize the PurchasesPerCustomer class.

    Args:
        df (pd.DataFrame): A dataframe with the transaction data. The dataframe must comply with the
            contain customer_id and transaction_id columns, which must be non-null.

    Raises:
        ValueError: If the dataframe doesn't contain the columns customer_id and transaction_id, or if the columns
            are null.

    """
    cols = ColumnHelper()
    required_cols = [cols.customer_id, cols.transaction_id]
    ensure_data_has_columns(df, required_cols)

    self.cust_purchases_s = df.groupby(cols.customer_id)[cols.transaction_id].nunique()

find_purchase_percentile(number_of_purchases, comparison='less_than_equal_to')

Find the percentile of the number of purchases.

Parameters:

Name Type Description Default
number_of_purchases int

The number of purchases to find the percentile of.

required
comparison str

The comparison to use. Defaults to "less_than_equal_to". Must be one of less_than, less_than_equal_to, equal_to, not_equal_to, greater_than, or greater_than_equal_to.

'less_than_equal_to'

Returns:

Name Type Description
float float

The percentile of the number of purchases.

Source code in openretailscience/analysis/customer.py
def find_purchase_percentile(self, number_of_purchases: int, comparison: str = "less_than_equal_to") -> float:
    """Find the percentile of the number of purchases.

    Args:
        number_of_purchases (int): The number of purchases to find the percentile of.
        comparison (str, optional): The comparison to use. Defaults to "less_than_equal_to". Must be one of
            less_than, less_than_equal_to, equal_to, not_equal_to, greater_than, or greater_than_equal_to.

    Returns:
        float: The percentile of the number of purchases.
    """
    ops = {
        "less_than": operator.lt,
        "less_than_equal_to": operator.le,
        "equal_to": operator.eq,
        "not_equal_to": operator.ne,
        "greater_than": operator.gt,
        "greater_than_equal_to": operator.ge,
    }

    if comparison not in ops:
        msg = f"Comparison must be one of {', '.join(repr(k) for k in ops)}"
        raise ValueError(msg)

    return len(self.cust_purchases_s[ops[comparison](self.cust_purchases_s, number_of_purchases)]) / len(
        self.cust_purchases_s,
    )

purchases_percentile(percentile=0.5)

Get the number of purchases at a given percentile.

Parameters:

Name Type Description Default
percentile float

The percentile to get the number of purchases at.

0.5

Returns:

Name Type Description
float float

The number of purchases at the given percentile.

Source code in openretailscience/analysis/customer.py
def purchases_percentile(self, percentile: float = 0.5) -> float:
    """Get the number of purchases at a given percentile.

    Args:
        percentile (float): The percentile to get the number of purchases at.

    Returns:
        float: The number of purchases at the given percentile.
    """
    return self.cust_purchases_s.quantile(percentile)

TransactionChurn

Computes the churn rate by number of purchases.

Attributes:

Name Type Description
purchase_dist_df DataFrame

The churn rate by number of purchases.

n_unique_customers int

The number of unique customers in the dataframe.

Source code in openretailscience/analysis/customer.py
class TransactionChurn:
    """Computes the churn rate by number of purchases.

    Attributes:
        purchase_dist_df (pd.DataFrame): The churn rate by number of purchases.
        n_unique_customers (int): The number of unique customers in the dataframe.
    """

    def __init__(self, df: pd.DataFrame, churn_period: float) -> None:
        """Initialize the TransactionChurn class.

        Args:
            df (pd.DataFrame): A dataframe with the transaction data. The dataframe must have the columns customer_id
                and transaction_date.
            churn_period (float): The number of days to consider a customer churned.

        Raises:
            ValueError: If the dataframe doesn't contain the columns customer_id and transaction_date.
        """
        cols = ColumnHelper()
        required_cols = [cols.customer_id, cols.transaction_date]
        ensure_data_has_columns(df, required_cols)

        purchase_dist_df = df[[cols.customer_id, cols.transaction_date]].copy()
        # Truncate the transaction_date to the day
        purchase_dist_df[cols.transaction_date] = df[cols.transaction_date].dt.floor("D")
        purchase_dist_df = purchase_dist_df.drop_duplicates()
        purchase_dist_df = purchase_dist_df.sort_values([cols.customer_id, cols.transaction_date])
        purchase_dist_df["transaction_number"] = purchase_dist_df.groupby(cols.customer_id).cumcount() + 1

        purchase_dist_df["last_transaction"] = (
            purchase_dist_df.groupby(cols.customer_id)[cols.transaction_date].shift(-1).isna()
        )
        purchase_dist_df["transaction_before_churn_window"] = purchase_dist_df[cols.transaction_date] < (
            purchase_dist_df[cols.transaction_date].max() - pd.Timedelta(days=churn_period)
        )
        purchase_dist_df["churned"] = (
            purchase_dist_df["last_transaction"] & purchase_dist_df["transaction_before_churn_window"]
        )

        purchase_dist_df = (
            purchase_dist_df[purchase_dist_df["transaction_before_churn_window"]]
            .groupby(["transaction_number"])["churned"]
            .value_counts()
            .unstack()
        )
        purchase_dist_df.columns = ["retained", "churned"]
        purchase_dist_df["churned_pct"] = purchase_dist_df["churned"].div(purchase_dist_df.sum(axis=1))
        self.purchase_dist_df = purchase_dist_df

        self.n_unique_customers = df[cols.customer_id].nunique()

__init__(df, churn_period)

Initialize the TransactionChurn class.

Parameters:

Name Type Description Default
df DataFrame

A dataframe with the transaction data. The dataframe must have the columns customer_id and transaction_date.

required
churn_period float

The number of days to consider a customer churned.

required

Raises:

Type Description
ValueError

If the dataframe doesn't contain the columns customer_id and transaction_date.

Source code in openretailscience/analysis/customer.py
def __init__(self, df: pd.DataFrame, churn_period: float) -> None:
    """Initialize the TransactionChurn class.

    Args:
        df (pd.DataFrame): A dataframe with the transaction data. The dataframe must have the columns customer_id
            and transaction_date.
        churn_period (float): The number of days to consider a customer churned.

    Raises:
        ValueError: If the dataframe doesn't contain the columns customer_id and transaction_date.
    """
    cols = ColumnHelper()
    required_cols = [cols.customer_id, cols.transaction_date]
    ensure_data_has_columns(df, required_cols)

    purchase_dist_df = df[[cols.customer_id, cols.transaction_date]].copy()
    # Truncate the transaction_date to the day
    purchase_dist_df[cols.transaction_date] = df[cols.transaction_date].dt.floor("D")
    purchase_dist_df = purchase_dist_df.drop_duplicates()
    purchase_dist_df = purchase_dist_df.sort_values([cols.customer_id, cols.transaction_date])
    purchase_dist_df["transaction_number"] = purchase_dist_df.groupby(cols.customer_id).cumcount() + 1

    purchase_dist_df["last_transaction"] = (
        purchase_dist_df.groupby(cols.customer_id)[cols.transaction_date].shift(-1).isna()
    )
    purchase_dist_df["transaction_before_churn_window"] = purchase_dist_df[cols.transaction_date] < (
        purchase_dist_df[cols.transaction_date].max() - pd.Timedelta(days=churn_period)
    )
    purchase_dist_df["churned"] = (
        purchase_dist_df["last_transaction"] & purchase_dist_df["transaction_before_churn_window"]
    )

    purchase_dist_df = (
        purchase_dist_df[purchase_dist_df["transaction_before_churn_window"]]
        .groupby(["transaction_number"])["churned"]
        .value_counts()
        .unstack()
    )
    purchase_dist_df.columns = ["retained", "churned"]
    purchase_dist_df["churned_pct"] = purchase_dist_df["churned"].div(purchase_dist_df.sum(axis=1))
    self.purchase_dist_df = purchase_dist_df

    self.n_unique_customers = df[cols.customer_id].nunique()