Analysis Modules
Analysis Modules
Cohort Analysis
The cohort analysis module provides functionality for analyzing customer retention patterns over time. It helps businesses understand customer behavior by tracking groups of users (cohorts) based on their first interaction and observing their activity over subsequent periods.
Cohort analysis is useful in multiple business applications:
- Customer Retention Analysis: Identifies how long users stay engaged with a product or service.
- Churn Rate Measurement: Helps determine at which stage customers tend to drop off.
- Marketing Performance Evaluation: Measures the long-term impact of marketing campaigns.
- Revenue Analysis: Tracks spending behavior over time to optimize pricing strategies.
- User Engagement Trends: Understands how different user segments behave based on their joining time.
This module calculates cohort tables using various aggregation functions such as nunique, sum, and mean, allowing
flexible analysis of customer data.
The following key metrics are used in the analysis:
- Aggregation Column: Defines the metric to track (e.g., unique customers, total spend).
- Aggregation Function: Determines how values are aggregated (e.g., sum, mean, count).
- Cohort Period: Defines the period granularity (year, quarter, month, week, or day).
- Retention Percentage: Calculates retention rates as a percentage of the first-period cohort.
Example:
import pandas as pd
from openretailscience.analysis.cohort import CohortAnalysis
df = pd.DataFrame({
"transaction_id": range(12),
"customer_id": [1, 2, 3, 1, 2, 3, 1, 2, 3, 4, 5, 4],
"transaction_date": pd.to_datetime([
"2023-01-15", "2023-01-20", "2023-02-05", "2023-02-10",
"2023-03-01", "2023-03-15", "2023-03-20", "2023-04-10",
"2023-04-25", "2023-05-05", "2023-05-20", "2023-06-10",
]),
"unit_spend": [100, 150, 200, 120, 160, 210, 130, 170, 220, 140, 180, 230],
})
cohort = CohortAnalysis(
df=df,
aggregation_column="unit_spend",
agg_func="sum",
period="month",
percentage=True,
)
cohort.df.head()
| min_period_shopped | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| 2023-01-01 | 1.00 | 1.00 | 1.00 | 1.00 |
| 2023-02-01 | 0.80 | 1.75 | 0.76 | 0.00 |
| 2023-03-01 | 0.00 | 0.00 | 0.00 | 0.00 |
| 2023-04-01 | 0.00 | 0.00 | 0.00 | 0.00 |
| 2023-05-01 | 1.28 | 1.92 | 0.00 | 0.00 |
Product Association Rules
The product association module implements functionality for generating product association rules, a powerful technique in retail analytics and market basket analysis.
Product Association Analysis (Market Basket Analysis) uncovers hidden relationships in customer purchasing behavior, transforming how retailers approach merchandising, marketing, and operations by revealing which products naturally sell together.
Business Problem:
Retailers lose revenue from missed cross-selling opportunities and poor product placement. Without understanding product associations, stores might place complementary items in different aisles, miss bundling opportunities, or stock out on associated items when promoting a product.
Real-World Applications:
- Strategic Merchandising: Place chips near beer when data shows strong association
- Dynamic Bundle Pricing: Create "Breakfast bundle" (Coffee + Pastry) when uplift shows synergy
- Personalized Recommendations: Power "Customers who bought X also bought Y" suggestions
- Inventory Optimization: Stock pasta sauce when pasta is promoted if association exists
- New Product Placement: Position private label next to associated national brands
Key Metrics Explained:
- Support: Proportion of transactions containing both products (frequency indicator)
- Confidence: Probability of buying B given A was purchased (predictive power)
- Uplift/Lift: How much more likely products are bought together than by chance
- Uplift > 1: Positive association (sell better together)
- Uplift = 1: Independent (no relationship)
- Uplift < 1: Negative association (rarely bought together)
Example:
from openretailscience.analysis.product_association import ProductAssociation
pa = ProductAssociation(
df,
value_col="product_name",
group_col="transaction_id",
)
pa.df.head()
| product_name_1 | product_name_2 | occurrences_1 | occurrences_2 | cooccurrences | support | confidence | uplift |
|---|---|---|---|---|---|---|---|
| 100 Animals Book | 100% Organic Cold-Pressed... | 78 | 78 | 1 | 0.000039 | 0.0128205 | 4.18 |
| 100 Animals Book | 20K Sousaphone | 78 | 81 | 3 | 0.000117 | 0.0384615 | 12.10 |
| 100 Animals Book | 360 Sport 2.0 Boxer Briefs | 78 | 79 | 1 | 0.000039 | 0.0128205 | 4.13 |
| 100 Animals Book | 4-Series 4K UHD | 78 | 82 | 1 | 0.000039 | 0.0128205 | 3.98 |
| 100 Animals Book | 700S Eterna Trumpet | 78 | 71 | 1 | 0.000039 | 0.0128205 | 4.60 |
Cross Shop
Cross Shop analysis reveals how customers navigate between different categories, brands, or store locations, replacing assumptions with data-driven insights about actual purchase patterns. The Venn diagram visualization immediately shows which products customers buy together versus separately.
Business Problem Solved:
Retailers often make incorrect assumptions about customer behavior. They might place baby products far from beer, not realizing these categories have high cross-shopping rates. This analysis reveals the truth about customer purchase patterns.
Real-World Applications:
- Store Layout Optimization: Place frequently cross-shopped categories near each other to reduce friction
- Promotional Strategy: Bundle products from highly cross-shopped categories for better lift
- Category Management: Understand interdependencies and identify opportunity categories
- Multi-Channel Strategy: Analyze cross-shopping between online and physical stores
- Competitive Analysis: Understand customer loyalty across competing brands
Example:
import pandas as pd
from openretailscience.analysis import cross_shop
df = pd.DataFrame({
"customer_id": [1, 2, 3, 4, 5, 5, 6, 9, 7, 7, 8, 9, 5, 8],
"category_name": [
"Electronics", "Clothing", "Home", "Sports", "Clothing", "Electronics", "Electronics",
"Clothing", "Home", "Electronics", "Clothing", "Electronics", "Home", "Home",
],
"unit_spend": [100, 200, 300, 400, 200, 500, 100, 200, 300, 350, 400, 500, 250, 360],
})
cs_customers = cross_shop.CrossShop(
df,
group_1_col="category_name",
group_1_val="Electronics",
group_2_val="Clothing",
group_3_val="Home",
labels=["Electronics", "Clothing", "Home"],
)
cs_customers.plot(
title="Customer Spend Overlap Across Categories",
source_text="Source: OpenRetailScience",
)
Gain Loss
The Gain Loss module (also known as switching analysis) helps analyze changes in customer behavior between two time periods. It breaks down revenue or customer movement between a focus group and a comparison group by:
- New customers: Customers who didn't purchase in period 1 but did in period 2
- Lost customers: Customers who purchased in period 1 but not in period 2
- Increased/decreased spending: Existing customers who changed their spending level
- Switching: Customers who moved between the focus and comparison groups
This module is particularly valuable for:
- Analyzing promotion cannibalization
- Understanding customer migration between brands or categories
- Evaluating the effectiveness of marketing campaigns
- Quantifying the sources of revenue changes
Example:
import pandas as pd
import numpy as np
from openretailscience.analysis.gain_loss import GainLoss
np.random.seed(42)
n_customers = 30
df = pd.DataFrame({
"customer_id": [f"C{i:03d}" for i in range(n_customers)] * 2,
"unit_spend": np.random.randint(10, 100, size=n_customers * 2),
"brand": np.random.choice(["Brand A", "Brand B"], size=n_customers * 2),
"period": ["p1"] * n_customers + ["p2"] * n_customers,
})
gain_loss = GainLoss(
df=df,
p1_index= df["period"] == "p1",
p2_index= df["period"] == "p2",
focus_group_index=df["brand"] == "Brand A",
focus_group_name="Brand A",
comparison_group_index=df["brand"] == "Brand B",
comparison_group_name="Brand B",
)
gain_loss.plot(
title="Brand A vs Brand B: Customer Movement Analysis",
x_label="Revenue Change",
source_text="Source: OpenRetailScience",
move_legend_outside=True,
)
Customer Decision Hierarchy
Customer Decision Hierarchy (CDH) analysis reveals how customers perceive products as substitutes or complements, enabling data-driven range planning decisions. By analyzing actual switching behavior rather than relying on product attributes or manager intuition, CDH identifies which products customers view as interchangeable versus essential variety.
Business Problem Solved:
Retailers struggle with range rationalization: Which products can be delisted without losing customers? When does variety add value versus create confusion? CDH answers these questions by examining customer purchase patterns to identify true substitutability.
How It Works:
- Products rarely bought by the same customer → likely substitutes
- Products often bought by the same customer → complements or variety-seeking
- Uses Yule's Q coefficient to measure substitutability strength
- Creates hierarchical clusters showing substitution relationships
Real-World Applications:
- Range Rationalization: Identify safe delisting candidates within substitute clusters
- New Product Introduction: Understand which existing products new items might cannibalize
- Private Label Strategy: Identify national brand products suitable for PL alternatives
- Space Optimization: Allocate more space to non-substitutable products
- Markdown Strategy: Clear substitute products sequentially, not simultaneously
Example:
from openretailscience.analysis.customer_decision_hierarchy import CustomerDecisionHierarchy
cdh = CustomerDecisionHierarchy(df)
ax = cdh.plot(
orientation="right",
source_text="Source: Transactions 2024",
title="Snack Food Substitutions",
)
Revenue Tree
The Revenue Tree is a hierarchical breakdown of factors contributing to overall revenue, allowing for detailed analysis of sales performance and identification of areas for improvement.
Key Components of the Revenue Tree:
-
Revenue: The top-level metric, calculated as Customers * Revenue per Customer.
-
Revenue per Customer: Average revenue generated per customer, calculated as: Orders per Customer * Average Order Value.
-
Orders per Customer: Average number of orders placed by each customer.
-
Average Order Value: Average monetary value of each order, calculated as: Items per Order * Price per Item.
-
Items per Order: Average number of items in each order.
-
Price per Item: Average price of each item sold.
Example:
import pandas as pd
import numpy as np
from openretailscience.analysis import revenue_tree
np.random.seed(42)
n = 100
df = pd.DataFrame({
"group_id": np.random.choice([1, 2], size=n),
"customer_id": np.random.randint(1, 31, size=n),
"transaction_id": np.arange(1, n + 1),
"unit_spend": np.random.uniform(50, 500, size=n).round(2),
"unit_quantity": np.random.randint(1, 6, size=n),
"transaction_date": pd.to_datetime(
np.random.choice(pd.date_range("2023-01-01", "2023-01-10"), size=n),
),
})
df["period"] = np.where(df["transaction_date"] < pd.Timestamp("2023-01-04"), "P1", "P2")
rev_tree = revenue_tree.RevenueTree(
df,
period_col="period",
p1_value="P1",
p2_value="P2",
)
HML Segmentation
Heavy, Medium, Light (HML) is a segmentation that places customers into groups based on their percentile of spend or the number of products they bought. Heavy customers are the top 20% of customers, medium are the next 30%, and light are the bottom 50% of customers. These values are chosen based on the proportions of the Pareto distribution. Often, purchase behavior follows this distribution, typified by the expression "20% of your customers generate 80% of your sales." HML segmentation helps answer questions such as:
- How much more are your best customers worth?
- How much more could you spend acquiring your best customers?
- What is the concentration of sales with your top (heavy) customers?
The module also handles customers with zero spend, with options to include them with light customers, exclude them entirely, or place them in a separate "Zero" segment.
Example:
from openretailscience.plots import bar
from openretailscience.segmentation.hml import HMLSegmentation
seg = HMLSegmentation(df, zero_value_customers="include_with_light")
bar.plot(
seg.df.groupby("segment_name")["unit_spend"].sum(),
value_col="unit_spend",
source_text="Source: OpenRetailScience",
sort_order="descending",
x_label="",
y_label="Segment Spend",
title="What's the value of a Heavy customer?",
rot=0,
)
Threshold Segmentation
Threshold Segmentation offers a flexible approach to customer grouping based on custom-defined percentile thresholds. Unlike the fixed 20/30/50 split in HML segmentation, Threshold Segmentation allows you to specify your own thresholds and segment names, making it adaptable to various business needs.
This flexibility enables you to:
- Create quartile segmentations (e.g., top 25%, next 25%, etc.)
- Define custom tiers based on your specific business model
- Segment customers based on alternative metrics beyond spend, such as visit frequency or product variety
Like HML segmentation, the module provides options for handling customers with zero values, allowing you to include them with the lowest segment, exclude them entirely, or place them in a separate segment.
Example:
from openretailscience.plots import bar
from openretailscience.segmentation.threshold import ThresholdSegmentation
# Create custom segmentation with quartiles
# Define thresholds at 25%, 50%, 75%, and 100% (quartiles)
thresholds = [0.25, 0.50, 0.75, 1.0]
segments = ["Bronze", "Silver", "Gold", "Platinum"]
seg = ThresholdSegmentation(
df=df,
thresholds=thresholds,
segments=segments,
zero_value_customers="separate_segment",
)
bar.plot(
seg.df.groupby("segment_name")["unit_spend"].sum(),
value_col="unit_spend",
source_text="Source: OpenRetailScience",
sort_order="descending",
x_label="",
y_label="Segment Spend",
title="Customer Value by Segment",
rot=0,
)
Segmentation Stats
The Segmentation Stats module provides functionality to calculate transaction statistics by segment for a particular segmentation. It makes it easy to compare key metrics across different segments, helping you understand how your customer (or transactions or promotions) groups differ in terms of spending behavior and transaction patterns. This module calculates metrics such as total spend, number of transactions, average spend per customer, and transactions per customer for each segment. It's particularly useful when combined with other segmentation approaches like HML segmentation.
Example:
from openretailscience.segmentation.segstats import SegTransactionStats
from openretailscience.segmentation.hml import HMLSegmentation
# First, segment customers using HML segmentation
segmentation = HMLSegmentation(my_table, zero_value_customers="include_with_light")
# Add segment labels to the transaction data using ibis join
table_with_segments = my_table.left_join(
segmentation.table,
"customer_id",
)
# Calculate transaction statistics by segment
segment_stats = SegTransactionStats(table_with_segments)
# Display the statistics
segment_stats.df
| segment_name | spend | transactions | customers | spend_per_customer | spend_per_transaction | transactions_per_customer | customers_pct |
|---|---|---|---|---|---|---|---|
| Heavy | 2927.21 | 30 | 10 | 292.721 | 97.5735 | 3 | 0.2 |
| Medium | 1014.97 | 45 | 15 | 67.6644 | 22.5548 | 3 | 0.3 |
| Light | 662.107 | 75 | 25 | 26.4843 | 8.82809 | 3 | 0.5 |
| Total | 4604.28 | 150 | 50 | 92.0856 | 30.6952 | 3 | 1 |
NLR (New-Lapsed-Repeating) Segmentation
NLR segmentation classifies customers into New, Repeating, and Lapsed based on their purchasing activity across two time periods. Given a baseline period (P1) and a comparison period (P2), customers are assigned:
- New: Positive spend in P2 only — acquired in the later period
- Repeating: Positive spend in both P1 and P2 — retained customers
- Lapsed: Positive spend in P1 only — stopped purchasing
This segmentation helps answer questions such as:
- How effective are your acquisition efforts at bringing in new customers?
- What proportion of your customer base is being retained period-over-period?
- How many customers have lapsed, and what is their value?
The module supports optional group-level segmentation (e.g., by store or category) so you can compare lifecycle dynamics across different parts of the business.
Example:
import pandas as pd
from openretailscience.segmentation.nlr import NLRSegmentation
df = pd.DataFrame({
"customer_id": [1, 1, 2, 2, 3, 4, 4, 5],
"unit_spend": [50.0, 75.0, 100.0, 120.0, 80.0, 60.0, 90.0, 110.0],
"period": ["P1", "P2", "P1", "P2", "P1", "P2", "P2", "P2"],
})
seg = NLRSegmentation(
df=df,
period_col="period",
p1_value="P1",
p2_value="P2",
)
seg.df
| customer_id | segment_name | unit_spend_p1 | unit_spend_p2 |
|---|---|---|---|
| 1 | Repeating | 50.0 | 75.0 |
| 2 | Repeating | 100.0 | 120.0 |
| 3 | Lapsed | 80.0 | 0.0 |
| 4 | New | 0.0 | 150.0 |
| 5 | New | 0.0 | 110.0 |
RFM Segmentation
Recency, Frequency, Monetary (RFM) segmentation categorizes customers based on their purchasing behavior:
- Recency (R): How recently a customer made a purchase
- Frequency (F): How often a customer makes purchases
- Monetary (M): How much a customer spends
Each metric is typically scored on a scale, and the combined RFM score helps businesses identify loyal customers, at-risk customers, and high-value buyers.
RFM segmentation helps answer questions such as:
- Who are your most valuable customers?
- Which customers are at risk of churn?
- Which customers should be targeted for re-engagement?
Example:
import pandas as pd
from openretailscience.segmentation.rfm import RFMSegmentation
df = pd.DataFrame({
"customer_id": [1, 1, 2, 2, 3, 3, 3],
"transaction_id": [101, 102, 201, 202, 301, 302, 303],
"transaction_date": pd.to_datetime([
"2024-03-01", "2024-03-10", "2024-02-20", "2024-02-25",
"2024-01-15", "2024-01-20", "2024-02-05",
]),
"unit_spend": [50, 75, 100, 150, 200, 250, 300],
})
rfm_segmenter = RFMSegmentation(df=df, current_date="2024-07-01")
rfm_results = rfm_segmenter.df
| customer_id | recency_days | frequency | monetary | r_score | f_score | m_score | rfm_segment | fm_segment |
|---|---|---|---|---|---|---|---|---|
| 1 | 113 | 2 | 125 | 2 | 0 | 0 | 200 | 0 |
| 2 | 127 | 2 | 250 | 1 | 1 | 1 | 111 | 11 |
| 3 | 147 | 3 | 750 | 0 | 2 | 2 | 22 | 22 |
Purchases Per Customer
The Purchases Per Customer module analyzes and visualizes the distribution of transaction frequency across your customer base. This module helps you understand customer purchasing patterns by percentile and is useful for determining values like your churn window.
Example:
from openretailscience.analysis.customer import PurchasesPerCustomer
from openretailscience.plots import histogram
ppc = PurchasesPerCustomer(transactions)
ax = histogram.plot(
df=ppc.cust_purchases_s,
title="Purchases per Customer",
x_label="Number of purchases",
y_label="Number of customers",
source_text="Source: OpenRetailScience",
)
ax.axvline(x=ppc.purchases_percentile(0.8), color="black", linestyle="--", lw=2)
Days Between Purchases
The Days Between Purchases module analyzes the time intervals between customer transactions, providing valuable insights into purchasing frequency and shopping patterns. This analysis helps you understand:
- How frequently your customers typically return to make purchases
- The distribution of purchase intervals across your customer base
- Which customer segments have shorter or longer repurchase cycles
- Where intervention might be needed to prevent customer churn
This information is critical for planning communication frequency, timing promotional campaigns, and developing effective retention strategies. The module can visualize both standard and cumulative distributions of days between purchases.
Example:
from openretailscience.analysis.customer import DaysBetweenPurchases
from openretailscience.plots import histogram
dbp = DaysBetweenPurchases(transactions)
ax = histogram.plot(
df=dbp.purchase_dist_s,
bins=15,
title="Average Days Between Customer Purchases",
x_label="Average Number of Days Between Purchases",
y_label="Number of Customers",
)
ax.axvline(x=dbp.purchases_percentile(0.5), color="black", linestyle="--", lw=2)
Transaction Churn
The Transaction Churn module analyzes how customer churn rates vary based on the number of purchases customers have made. This helps reveal critical retention thresholds in the customer lifecycle when setting a churn window
Example:
from openretailscience.analysis.customer import TransactionChurn
from openretailscience.plots import area
tc = TransactionChurn(transactions, churn_period=churn_period)
cumulative_churn_rate = (
tc.purchase_dist_df["churned"].cumsum().div(tc.n_unique_customers).to_frame(name="cumulative_churn_rate")
)
area.plot(
df=cumulative_churn_rate,
value_col="cumulative_churn_rate",
title="Churn Rate by Number of Purchases",
x_label="Number of Purchases",
y_label="% Churned (cumulative)",
source_text="Source: OpenRetailScience",
)
Composite Rank
The Composite Rank module enables data-driven multi-factor decision making by combining multiple performance metrics into a single actionable ranking. This is essential when no single metric tells the complete story - for instance, a product might have high sales but low margin, or a supplier might offer great prices but poor delivery reliability.
Real-World Applications:
- Product Range Optimization: Balance sales velocity, margin, stock turn, and ratings for listing decisions
- Supplier Performance Management: Evaluate based on price, quality, delivery, and payment terms
- Store Performance Assessment: Rank stores using sales per sq ft, conversion rates, labor productivity, and NPS
- Category Management: Prioritize categories for space allocation using growth, profitability, and market share
Group-Based Ranking:
The module supports both global ranking (across entire dataset) and group-based ranking (within categories):
- Global Ranking: Rank all products together regardless of category
- Group-Based Ranking: Rank products within each category (electronics vs electronics, apparel vs apparel)
- Use Cases: Category management, regional store performance, supplier evaluation by specialization
Business Value:
- Removes bias through systematic multi-factor evaluation
- Scales to thousands of products/stores/suppliers simultaneously
- Provides transparent methodology stakeholders can trust
- Enables clear cut-off decisions based on composite performance
- Supports fair comparison within relevant peer groups
Aggregation Strategies:
- Mean: Balanced scorecard approach, all factors equally important
- Min: Conservative approach, focus on worst-performing metric
- Max: Optimistic approach, highlight strength in any area
- Sum: Cumulative performance across all dimensions
Example:
import pandas as pd
from openretailscience.analysis.composite_rank import CompositeRank
# Create sample data for products with categories
df = pd.DataFrame({
"product_id": [1, 2, 3, 4, 5, 6],
"product_category": ["Electronics", "Electronics", "Electronics", "Apparel", "Apparel", "Apparel"],
"spend": [100, 150, 75, 200, 125, 80],
"customers": [20, 30, 15, 40, 25, 18],
"spend_per_customer": [5.0, 5.0, 5.0, 5.0, 5.0, 4.4],
})
# Create CompositeRank with multiple columns
cr = CompositeRank(
df=df,
rank_cols=[
("spend", "desc"), # Higher spend is better
("customers", "desc"), # Higher customer count is better
("spend_per_customer", "desc") # Higher spend per customer is better
],
agg_func="mean", # Use mean to aggregate ranks
ignore_ties=False, # Keep ties (rows with same values get same rank)
group_col="product_category" # Rank within categories
)
cr.df.sort_values(["product_category", "composite_rank"])
| product_id | product_category | spend | customers | spend_per_customer | spend_rank | customers_rank | spend_per_customer_rank | composite_rank |
|---|---|---|---|---|---|---|---|---|
| 4 | Apparel | 200 | 40 | 5.0 | 1 | 1 | 1 | 1.0 |
| 5 | Apparel | 125 | 25 | 5.0 | 2 | 2 | 1 | 1.67 |
| 6 | Apparel | 80 | 18 | 4.4 | 3 | 3 | 3 | 3.0 |
| 2 | Electronics | 150 | 30 | 5.0 | 1 | 1 | 1 | 1.0 |
| 1 | Electronics | 100 | 20 | 5.0 | 2 | 2 | 1 | 1.67 |
| 3 | Electronics | 75 | 15 | 5.0 | 3 | 3 | 1 | 2.33 |
Utils
Filter and Label by Periods
The Filter and Label by Periods module allows you to:
- Filter transaction data to specific time periods (e.g., quarters, months, promotional periods)
- Add period labels to your data for easy segmentation and comparison
- Analyze before-and-after performance for events or promotions
- Compare metrics across different time frames consistently
This functionality is particularly useful for:
- Comparing KPIs across fiscal quarters or years
- Analyzing seasonal performance patterns
- Measuring the impact of promotions or events
- Creating period-based visualizations with consistent data preparation
Example:
import pandas as pd
import ibis
from openretailscience.utils.date import filter_and_label_by_periods
# Create a sample transactions table
transactions = ibis.memtable(pd.DataFrame({
"transaction_id": range(1, 101),
"transaction_date": pd.date_range(start="2023-01-01", periods=100, freq="D"),
"customer_id": [f"C{i % 20 + 1}" for i in range(100)],
"amount": [float(i % 5 * 25 + 50) for i in range(100)],
}))
# Define period ranges for analysis
period_ranges = {
"Pre-Promotion": ("2023-01-01", "2023-01-31"),
"Promotion": ("2023-02-01", "2023-02-28"),
"Post-Promotion": ("2023-03-01", "2023-03-31")
}
# Filter transactions to the defined periods and add period labels
result_df = filter_and_label_by_periods(transactions, period_ranges).execute()
# Calculate KPIs by period
result_df.groupby("period_name").agg(
transaction_count=("transaction_id", "count"),
total_sales=("amount", "sum"),
avg_transaction_value=("amount", "mean")
)
| period_name | transaction_count | total_sales | avg_transaction_value |
|---|---|---|---|
| Pre-Promotion | 31 | 3050.0 | 98.39 |
| Promotion | 28 | 2800.0 | 100.00 |
| Post-Promotion | 31 | 3150.0 | 101.61 |
Find Overlapping Periods
The Find Overlapping Periods module allows you to:
- Identify overlapping periods between a given start and end date.
- Split the date range into yearly periods that start from the given start date for the first period and then yearly thereafter, ending on the provided end date.
- Return results either as ISO-formatted strings (
"YYYY-MM-DD") or asdatetimeobjects.
This functionality is particularly useful for:
- Analyzing seasonal or yearly patterns in datasets.
- Comparing data across specific date ranges.
- Structuring time-based segmentations efficiently.
Example:
from datetime import datetime
from openretailscience.utils.date import find_overlapping_periods
# Example with string input
overlapping_periods = find_overlapping_periods("2022-06-15", "2025-03-10")
print(overlapping_periods)
| Start Date | End Date |
|---|---|
| 2022-06-15 | 2023-03-10 |
| 2023-06-15 | 2024-03-10 |
| 2024-06-15 | 2025-03-10 |
Filter and Label by Condition
The Filter and Label by Condition module allows you to:
- Filter data based on arbitrary conditions (e.g., category, region, price range)
- Add descriptive labels to filtered rows for easier segmentation
- Prepare labeled subsets for downstream analysis or visualization
- Combine multiple Boolean conditions into a single, labeled dataset
This functionality is particularly useful for:
- Segmenting customers or products by custom-defined rules
- Categorizing transactions based on business logic
- Creating labeled training data for machine learning
- Analyzing metrics across different business segments
Example:
import pandas as pd
import ibis
from openretailscience.utils.filter_and_label import filter_and_label_by_condition
# Sample product table
products = ibis.memtable(pd.DataFrame({
"product_id": range(1, 9),
"category": ["toys", "shoes", "toys", "books", "electronics", "toys", "shoes", "books"],
"price": [15, 55, 25, 10, 200, 35, 60, 20],
}))
# Define filter conditions
conditions = {
"Toys": products["category"] == "toys",
"Shoes": products["category"] == "shoes",
"Premium Electronics": (products["category"] == "electronics") & (products["price"] > 100)
}
# Apply filtering and labeling
labeled_data = filter_and_label_by_condition(products, conditions).execute()
| product_id | category | price | label |
|---|---|---|---|
| 1 | toys | 15 | Toys |
| 2 | shoes | 55 | Shoes |
| 3 | toys | 25 | Toys |
| 5 | electronics | 200 | Premium Electronics |
| 6 | toys | 35 | Toys |
| 7 | shoes | 60 | Shoes |
Label by Condition
The Label by Condition module provides functionality to label groups (customers, transactions, stores, etc.) based on whether they contain items that meet specified conditions. This module is designed for group-level analysis where you want to classify entire entities rather than individual records. Unlike the Filter and Label by Condition function which filters and labels individual rows of data, this module aggregates data by groups and applies labels at the group level.
The Label by Condition module allows you to:
- Label groups in a table based on whether items in the group meet a specified condition
- Support both binary labeling (contains/not_contains) and extended labeling (contains/mixed/not_contains)
- Customize label names and return column names for flexible analysis
- Analyze group-level patterns for customer segmentation, product categorization, and promotional analysis
This functionality is particularly useful for:
- Tagging transactions as containing a product, product category, or promotion
- Tagging customers as having bought a product, product category, or promotion, or store_id
- Segmenting customers as new, repeating or lapsed
Example:
import pandas as pd
import ibis
from openretailscience.utils.label import label_by_condition
# Sample transaction data
transactions = ibis.memtable(pd.DataFrame({
"customer_id": [1, 1, 1, 2, 2, 3, 3],
"product_category": ["toys", "books", "toys", "books", "clothes", "clothes", "clothes"],
}))
# Binary labeling: Label customers who bought any toys
toy_customers = label_by_condition(
table=transactions,
label_col="customer_id",
condition=transactions["product_category"] == "toys",
labeling_strategy="binary"
).execute()
| customer_id | label_name |
|---|---|
| 1 | contains |
| 2 | not_contains |
| 3 | not_contains |