Script 819: Campaign Anomalies Amazon Music Global AMU Signups

Purpose:

The Python script identifies and reports campaign-level outliers in key conversion metrics, such as Total AMU Sign-Ups and Podcast First Stream, by calculating anomalies using adjustable thresholds.
To Elaborate

The script is designed to detect anomalies in campaign performance metrics, specifically focusing on Total AMU Sign-Ups and Podcast First Stream conversions. It uses statistical methods to identify outliers by calculating forecasted values and comparing them with actual results. The script applies adjustable thresholds for Interquartile Range (IQR) and deviation to determine significant deviations from expected performance. It processes data from the last eight weeks, generating a weekly report on Thursdays to account for conversion lag. The script filters campaigns to focus on high-traffic ones, ensuring that only those contributing significantly to the overall metrics are analyzed. It then calculates various metrics, such as conversion rates and cost per click, and aggregates data to identify trends and anomalies. The results are summarized in a report, highlighting campaigns with significant deviations and providing insights into performance trends across different categories and countries.
Walking Through the Code

Initialization and Setup
- The script begins by defining constants and user-configurable parameters, such as the metrics to analyze and thresholds for outlier detection.
- It checks whether the code is running on a server or locally, loading necessary data and libraries accordingly.
Data Preparation
- The script filters the input data to focus on high-traffic campaigns, using a specified metric and threshold to determine which campaigns to include.
- It calculates additional metrics like conversion rate and cost per click for each campaign.
Anomaly Detection
- The script defines functions to calculate forecasts using exponential smoothing and to determine interquartile ranges for historical data.
- It computes deviation ratios and anomaly scores for each metric, flagging significant deviations based on user-defined thresholds.
Aggregation and Reporting
- Data is aggregated by campaign, and anomaly scores are calculated to identify campaigns with significant deviations.
- The script generates a summary report, highlighting campaigns with notable performance changes and categorizing them by trend (positive, negative, or mixed).
Output and Configuration
- The script prepares the output data for reporting, including generating a prompt for email summaries and configuring deep links for further analysis.
- It handles local development scenarios by saving outputs to files for debugging purposes.
Vitals

Script ID : 819
Client ID / Customer ID: 197178269 / 13095968
Action Type: Email Report
Item Changed: None
Output Columns:
Linked Datasource: M1 Report
Reference Datasource: None
Owner: Michael Huang (mhuang@marinsoftware.com)
Created by Michael Huang on 2024-03-15 11:22
Last Updated by Michael Huang on 2024-12-23 07:02
> See it in Action

Python Code

##
## name: Conversion Outlier Detection Report with Summary
## description:
##  * identify Conversion outliers from key conversion types
##    - Total AMU Sign-Ups
##    - Podcast First Stream
##  * calculate anomaly via adjustable IQR and Deviation Thresholds
##  * require weekly report from last 8 weeks, generated on Thur to allow 4 days of conv lag
##
##
## author: Michael S. Huang
## created: 2024-03-15
## 

RPT_COL_CLIENT = 'Client'
RPT_COL_CURRENCY = 'Currency'
RPT_COL_CAMPAIGN = 'Campaign'
RPT_COL_DATE = 'Date'
RPT_COL_PUBLISHER = 'Publisher'
RPT_COL_ACCOUNT = 'Account'
RPT_COL_CAMPAIGN_CATEGORY = 'Campaign_Category'
RPT_COL_IMPR = 'Impr.'
RPT_COL_CLICKS = 'Clicks'
RPT_COL_PUB_COST = 'Pub. Cost'
RPT_COL_TOTAL_AMU_SIGNUPS_CONV = 'Total AMU Sign-Ups Conv.'
RPT_COL_PODCAST_FIRST_STREAM_CONV = 'Podcast First Stream Conv.'
RPT_COL_CONV = RPT_COL_TOTAL_AMU_SIGNUPS_CONV

COL_CONV_RATE = 'CVR'
COL_COST_PER_CLICK = 'CPC'
COL_CTR = 'CTR %'
COL_COST_PER_LEAD = 'CPL'

# column names
COL_FORECAST = 'forecast'
COL_ACTUAL = 'actual'
COL_TRAILING = 'trailing'
COL_IQR = 'z_iqr'
COL_DEVIATION = 'z_deviation'
COL_DEVIATION_PCT = 'deviation_pct'

COL_DEVIATION_RATIO = 'z_deviation_ratio'
COL_DEVIATION_RATIO_FLAG_COUNT = COL_DEVIATION_RATIO + '_flagged'

COL_OUTLIER_SCORE = 'z_outlier_score'
COL_OUTLIER_SCORE_FLAG_COUNT = COL_OUTLIER_SCORE + '_flagged'

COL_OUTLIER_DEVIATION_FLAG_COUNT = 'zz_outlier_deviation_flagged'
COL_OUTLIER_DEVIATION_FLAG_COUNT_SCALED = COL_OUTLIER_DEVIATION_FLAG_COUNT + '_scaled'

COL_MOST_UPWARD_OUTLIER_METRIC = 'zz_most_upward_outlier_metric'
COL_MOST_DOWNWARD_OUTLIER_METRIC = 'zz_most_downward_outlier_metric'

COL_TOTAL_FLAG_COUNT_SCALED = 'zz_total_flag_count_scaled'
COL_TRAILING_COST = RPT_COL_PUB_COST + '_' + COL_TRAILING

COL_TREND = 'Trend'
COL_COUNTRY = 'Country'

## NB. Row key columns must be 'mscripts_row_key' for backend to recognize
COL_ROW_KEY = 'mscripts_row_key'

########### START - User Params ###########

# only focus on high traffic campaigns
# note: to use Pub Cost/Revenue, need to currency convert first
TOP_CAMPAIGN_METRIC = RPT_COL_CONV 
FRACTION_OF_TOP_CAMPAIGNS_TO_INCLUDE = 0.85
MIN_THRESHOLD_METRIC = RPT_COL_CONV
MIN_THRESHOLD = 10


# lookback window for forecast
MIN_FORECAST_LOOKBACK_WEEKS = 7

# Metrics to include in Report
# Format: (Metric, Outlier Threshold, Deviation Threshold, Forecast Precision)
# Metric = metrics to analyze
# Outlier Threshold: IQR multiplier; 1.5 is equivalent to 97.5% percentile
# Deviation Threshold: deviation threshold (in decimal; 0.20 = 20%)
# Forecast Precison = number of decimal places; 0 for integer
REPORT_METRICS = [
    (RPT_COL_CONV,          1.5,    0.20, 0),
    (COL_CONV_RATE,         1.5,    0.20, 2),
]

########### END - User Params ###########

########### START - Local Mode Config ###########
# Step 1: Uncomment download_preview_input flag and run Preview successfully with the Datasources you want
download_preview_input=False
# Step 2: In MarinOne, go to Scripts -> Preview -> Logs, download 'dataSourceDict' pickle file, and update pickle_path below
# pickle_path = ''
pickle_path = '/Users/mhuang/Downloads/pickle/amazon_music_anomaly_weekly_20241223.pkl'
# Step 3: Copy this script into local IDE with Python virtual env loaded with pandas and numpy.
# Step 4: Run locally with below code to init dataSourceDict

# determine if code is running on server or locally
def is_executing_on_server():
    try:
        # Attempt to access a known restricted builtin
        dict_items = dataSourceDict.items()
        return True
    except NameError:
        # NameError: dataSourceDict object is missing (indicating not on server)
        return False

local_dev = False

if is_executing_on_server():
    print("Code is executing on server. Skip init.")
elif len(pickle_path) > 3:
    print("Code is NOT executing on server. Doing init.")
    local_dev = True
    # load dataSourceDict via pickled file
    import pickle
    dataSourceDict = pickle.load(open(pickle_path, 'rb'))

    # print shape and first 5 rows for each entry in dataSourceDict
    for key, value in dataSourceDict.items():
        print(f"Shape of dataSourceDict[{key}]: {value.shape}")
        # print(f"First 5 rows of dataSourceDict[{key}]:\n{value.head(5)}")

    # set outputDf same as inputDf
    inputDf = dataSourceDict["1"]
    outputDf = inputDf.copy()

    # setup timezone
    import datetime
    # Chicago Timezone is GMT-5. Adjust as needed.
    CLIENT_TIMEZONE = datetime.timezone(datetime.timedelta(hours=-5))

    # import pandas
    import pandas as pd
    import numpy as np

    # other imports
    import re
    import urllib

    # import Marin util functions
    # from marin_scripts_utils import tableize, select_changed

    # pandas settings
    pd.set_option('display.max_columns', None)  # Display all columns
    pd.set_option('display.max_colwidth', None)  # Display full content of each column

else:
    print("Running locally but no pickle path defined. dataSourceDict not loaded.")
    exit(1)
########### END - Local Mode Setup ###########


########### Anomaly Detection Libray Functions #############

### Forecast and Anomaly functions

# get forecast via simple exponential smoothing
# note: changed to handle weekly not daily data
def get_forecasts(data, decimals=0):
    if len(data) >= MIN_FORECAST_LOOKBACK_WEEKS:
        # print("data len: ", len(data))
        # print("data.index", data.index)
        # print("data", data)

        # exclude most recent data point
        hist_data = data.iloc[:-1]

        # exponential smoothing
        alpha = 0.5 # smoothing factor
        forecasts = hist_data.ewm(alpha=alpha).mean().iloc[-1]
        if decimals == 0:
            forecasts = forecasts.astype(int)
        else:
            forecasts = forecasts.round(decimals)

        return forecasts
    else:
        print("not enough data. skipping: ", data.index)
    
    return None

# get interquartile range from previous weeks
def get_inter_quartile_ranges(data):
    if len(data) >= MIN_FORECAST_LOOKBACK_WEEKS:
        # print("data.index", data.index)
        # print("data", data)

        # exclude most recent data point
        hist_data = data.iloc[:-1]

        if np.isnan(hist_data).any():
            print("fixing nan in hist_data")
            hist_data = hist_data.fillna(0)
        

        # iqrs = np.std(hist_data, axis=0)

        # calculate interquartile range (IQR)
        Q1 = hist_data.quantile(0.25)
        Q3 = hist_data.quantile(0.75)
        IQR = Q3 - Q1

        return IQR
    else:
        print("not enough data. skipping: ", data.index)
    
    return None


# most recent data point is the last item
get_actuals = lambda x: x.iloc[[-1]]

# get trailing total
def get_trailing_total(data, window=4):
    if len(data) >= window:
        index_previous_periods = list(range(-1, -(window+1), -1))
        hist_data = data.iloc[index_previous_periods]
        return hist_data.sum()
    else:
        print("not enough data. skipping: ", data.index)
    
    return None

# ### Calculate anomaly score

# calc anomaly score across list of metrics
def calc_anomaly_scores(df, metrics_and_weights):
    df = df.copy()

    deviation_ratio_list = []
    outlier_score_list = []

    for (metric, _, _, _) in metrics_and_weights:
        forecast = df.loc[:, (metric, COL_FORECAST)]
        actual = df.loc[:, (metric, COL_ACTUAL)]
        iqr = df.loc[:, (metric, COL_IQR)]

        if np.isnan(forecast).any():
            print("nan in forecast")
            forecast = np.nan_to_num(forecast)
            
        if np.isnan(actual).any():
            print("nan in actual")
            actual = np.nan_to_num(actual)

        if np.isnan(iqr).any():
            print("nan in iqr")
            iqr = np.nan_to_num(iqr)

        # negative deviation when less than forecasted
        deviation = np.subtract(actual, forecast)
        
        df.loc[:, (metric, COL_DEVIATION)] = deviation

        # when both forecasted and actual values are ZERO, deviation should be ZERO
        # when forecasted is ZERO but actual is not, set to 100% deviation
        deviation_ratio = np.where(forecast > 0, \
                            deviation/forecast, \
                            np.where(actual > 0, 1.0, 0.0))
        deviation_ratio_list.append(deviation_ratio)
        df.loc[:, (metric, COL_DEVIATION_RATIO)] = deviation_ratio
        df.loc[:, (metric, COL_DEVIATION_PCT)] = np.char.add(np.char.mod('%0.0f', deviation_ratio * 100), '%')

        # anomaly score is ratio of deviation with Inter Quartile Range; score of 1.5 would be 97.5% percentile
        # positive score means exceeding forecast
        # if IQR is 0, default anomaly score to 0 so it won't trigger any alerts
        score = np.where(abs(iqr) > 0, deviation / iqr, 0.0)

        outlier_score_list.append(score)
        df.loc[:, (metric, COL_OUTLIER_SCORE)] = score

    # flag scores that exceed the anomaly threshold
    anomaly_thresholds = np.array([threshold for (_, threshold, _, _) in metrics_and_weights])
    scores_stack = np.stack(outlier_score_list, axis=0)
    flagged_outlier_score_list = np.where(np.abs(scores_stack) > anomaly_thresholds[:, None], 1, 0)
    # sum across metrics and save for output
    df[COL_OUTLIER_SCORE_FLAG_COUNT] = np.sum(flagged_outlier_score_list, axis=0)

    # flag deviation ratios that exceed the deviation threshold
    deviation_thresholds = np.array([threshold for (_, _, threshold, _) in metrics_and_weights])
    deviation_ratios_stack = np.stack(deviation_ratio_list, axis=0)
    flagged_deviation_ratio_list = np.where(np.abs(deviation_ratios_stack) > deviation_thresholds[:, None], 1, 0)
    # sum across metrics and save for output
    df[COL_DEVIATION_RATIO_FLAG_COUNT] = np.sum(flagged_deviation_ratio_list, axis=0)

    # flag anomalous deviations by combining both flags above
    # AND flags in flagged_outlier_score_list and flagged_deviation_ratio_list
    # get the count of metrics where both are 1
    combined_flags = np.logical_and(flagged_outlier_score_list, flagged_deviation_ratio_list)
    df[COL_OUTLIER_DEVIATION_FLAG_COUNT] = np.sum(combined_flags, axis=0)

    # scaled score highlights larger spenders with more anomaly or deviation flags
    # use trailing clicks as proxy metrics, since this script runs across many currencies and don't want to currency conversion here
    trailing_clicks = df.loc[:, (RPT_COL_CLICKS, COL_TRAILING)]
    df[COL_TOTAL_FLAG_COUNT_SCALED] = np.round((df[COL_OUTLIER_SCORE_FLAG_COUNT] + df[COL_DEVIATION_RATIO_FLAG_COUNT]) * trailing_clicks, 0)
    
    # another version that highlights larger spenders with anomalous deviation flags
    df[COL_OUTLIER_DEVIATION_FLAG_COUNT_SCALED] = np.round(df[COL_OUTLIER_DEVIATION_FLAG_COUNT] * trailing_clicks, 0)


    # calc best & worst changes; scale change by outlier score (use absolute value to avoid change sign of deviation)
    # Using numpy.nan_to_num to fill in NA
    change_score = np.nan_to_num(np.multiply(deviation_ratio_list, np.abs(outlier_score_list)), copy=False)
    scores_stack = np.stack(change_score, axis=0)
    max_scores = np.maximum.reduce(scores_stack, axis=0)
    min_scores = np.minimum.reduce(scores_stack, axis=0)
    max_score_indices = np.argmax(scores_stack, axis=0)
    min_score_indices = np.argmin(scores_stack, axis=0)

    # fill in corresponding metric names
    metric_names = [metric for (metric, weight, _, _) in metrics_and_weights]
    df[COL_MOST_UPWARD_OUTLIER_METRIC] = [metric_names[idx] if score > 0 else np.nan for (score, idx) in zip(max_scores, max_score_indices)]
    df[COL_MOST_DOWNWARD_OUTLIER_METRIC] = [metric_names[idx] if score < 0 else np.nan for (score, idx) in zip(min_scores, min_score_indices)]

    # resort columns
    df.columns = df.columns.swaplevel(0, 1)
    df.sort_index(axis=1, inplace=True)
    df.columns = df.columns.swaplevel(1, 0)
    df.sort_index(axis=1, inplace=True)

    # return everything for debugging
    # return (df.sort_values(by=COL_OUTLIER_SCORE_FLAG_COUNT, axis=0, ascending=False), max_scores, min_scores, max_score_indices, min_score_indices)
    return df.sort_values(by=COL_OUTLIER_SCORE_FLAG_COUNT, axis=0, ascending=False)


# convert to percentage units with 2 decimal places
def safe_percentage(numerator, denominator):
    return np.where(denominator > 0, \
                    round(numerator / denominator * 100, 2), \
                    0)


########### END Functions ###########


#### User Starts Here

print('inputDf.info\n',inputDf.info())

min_input_date = min(inputDf[RPT_COL_DATE])
max_input_date = max(inputDf[RPT_COL_DATE])
print(f"Input date range: {min_input_date.date()} to {max_input_date.date()}")

# calculate report coverage date

report_date_start = (pd.to_datetime(max_input_date))
report_date_end = report_date_start + pd.Timedelta(days=6)

print(f"Most recent input date is {max_input_date.date()}. Report Week set to: {report_date_start.date()} to {report_date_end.date()}.")

inputDf_reduced = inputDf \
                         .reset_index() \
                         .set_index([RPT_COL_CLIENT, RPT_COL_ACCOUNT, RPT_COL_CAMPAIGN])

# set blank campaign category to Untagged
inputDf_reduced[RPT_COL_CAMPAIGN_CATEGORY] = inputDf_reduced[RPT_COL_CAMPAIGN_CATEGORY].fillna('Untagged')



### Keep only Top Campaigns via configured metric TOP_CAMPAIGN_METRIC (30-day lookback)

# get trailing 30-day total by campaign

agg_func = {
		TOP_CAMPAIGN_METRIC: ['sum'],
        MIN_THRESHOLD_METRIC: ['sum'],
}

thirty_days_ago = pd.to_datetime(max_input_date - datetime.timedelta(days=30))

df_camp_agg = inputDf_reduced.loc[inputDf_reduced[RPT_COL_DATE] >= thirty_days_ago] \
                              .groupby([RPT_COL_CLIENT, RPT_COL_ACCOUNT, RPT_COL_CAMPAIGN]) \
                              .agg(agg_func) \
                              .droplevel(1, axis=1) \
                              .sort_values([TOP_CAMPAIGN_METRIC], ascending=False)

metric_subtotal = df_camp_agg[TOP_CAMPAIGN_METRIC].sum()

COL_TOP_CAMPAIGN_METRIC_CUMULATIVE = TOP_CAMPAIGN_METRIC+'_cumulative'
COL_TOP_CAMPAIGN_METRIC_CUMULATIVE_PCT = TOP_CAMPAIGN_METRIC+'_cumulative_pct'

df_camp_agg[COL_TOP_CAMPAIGN_METRIC_CUMULATIVE] = df_camp_agg[TOP_CAMPAIGN_METRIC].cumsum()
df_camp_agg[COL_TOP_CAMPAIGN_METRIC_CUMULATIVE_PCT] = df_camp_agg[COL_TOP_CAMPAIGN_METRIC_CUMULATIVE] / metric_subtotal

top_campaign_metric_cutoff = metric_subtotal * FRACTION_OF_TOP_CAMPAIGNS_TO_INCLUDE
df_top_campaigns = df_camp_agg.loc[ df_camp_agg[COL_TOP_CAMPAIGN_METRIC_CUMULATIVE] <= top_campaign_metric_cutoff ] \
                                       .sort_values([TOP_CAMPAIGN_METRIC], ascending=False)

print(f"For metric '{TOP_CAMPAIGN_METRIC}', trailing 30-day sub-total across all {df_camp_agg.shape[0]:,} campaigns is {round(metric_subtotal):,}. {FRACTION_OF_TOP_CAMPAIGNS_TO_INCLUDE*100}% of it ({round(top_campaign_metric_cutoff):,}) comes from just {df_top_campaigns.shape[0]} campaigns.")

# apply minimum threshold
before_count = df_top_campaigns.shape[0]
df_top_campaigns = df_top_campaigns.loc[df_top_campaigns[MIN_THRESHOLD_METRIC] > MIN_THRESHOLD]
after_count = df_top_campaigns.shape[0]
print(f"Applying min thres of {MIN_THRESHOLD} to {MIN_THRESHOLD_METRIC} trimmed off {before_count-after_count} campaigns")


# actually filter by top campaigns
before_count = inputDf_reduced.shape[0]
inputDf_reduced = inputDf_reduced.loc[df_top_campaigns.index]
after_count = inputDf_reduced.shape[0]
print(f"Applying top campaign critera reduced input row count from {before_count} to {after_count}")


# compute ratio metrics for each date

inputDf_reduced[COL_CONV_RATE] = safe_percentage(inputDf_reduced[RPT_COL_CONV], inputDf_reduced[RPT_COL_CLICKS])
inputDf_reduced[COL_COST_PER_CLICK] = inputDf_reduced[RPT_COL_PUB_COST] / inputDf_reduced[RPT_COL_CLICKS]
inputDf_reduced[COL_CTR] = safe_percentage(inputDf_reduced[RPT_COL_CLICKS], inputDf_reduced[RPT_COL_IMPR])

### Aggregate across Dates for Campaigns; calculate Forecast & Actual values
agg_spec = {
    metric: [
        (COL_FORECAST, lambda x: get_forecasts(x, decimals=decimals)),
        (COL_IQR, get_inter_quartile_ranges),
        (COL_ACTUAL, get_actuals)
    ] 
    for metric, _, _, decimals in REPORT_METRICS
}

# Now, add the additional keys for RPT_COL_PUB_COST and RPT_COL_CLICKS
agg_spec[RPT_COL_PUB_COST] = agg_spec.get(RPT_COL_PUB_COST, []) + [(COL_TRAILING, get_trailing_total)]
agg_spec[RPT_COL_CLICKS] = agg_spec.get(RPT_COL_CLICKS, []) + [(COL_TRAILING, get_trailing_total)]


df_campaign = inputDf_reduced \
                        .fillna(0) \
                        .replace([np.inf, -np.inf], 0) \
                        .groupby([RPT_COL_CLIENT, RPT_COL_CURRENCY, RPT_COL_CAMPAIGN_CATEGORY, RPT_COL_PUBLISHER, RPT_COL_ACCOUNT, RPT_COL_CAMPAIGN]) \
                        .agg(agg_spec)

### Compute Anomaly Scores

df_campaign_anomaly = calc_anomaly_scores(df_campaign, REPORT_METRICS)


### Find Outlier Trafficking Accounts and Campaigns

highlight_campaigns = df_campaign_anomaly.loc[ \
                                              (df_campaign_anomaly[COL_OUTLIER_DEVIATION_FLAG_COUNT] > 0)
                                             ] \
                                         .sort_values(by=[COL_OUTLIER_DEVIATION_FLAG_COUNT_SCALED], ascending=[False]) \
                                         .reset_index()
print("highlight_campaigns: ", highlight_campaigns.shape[0])

### Add some metadata

# add row key for Deep Links
highlight_campaigns[COL_ROW_KEY] = ['row_' + str(i) + '_key' for i in range(1, len(highlight_campaigns) + 1)]

# parse Country from campaign name
if not highlight_campaigns.empty:
    highlight_campaigns[COL_COUNTRY] = highlight_campaigns[RPT_COL_CAMPAIGN].str.split('_', expand=True)[0]
else:
    highlight_campaigns[COL_COUNTRY] = pd.Series(dtype='object')

# get list of Countries and Categories for building prompt
categories = highlight_campaigns[RPT_COL_CAMPAIGN_CATEGORY].value_counts().index.tolist()
countries = highlight_campaigns[COL_COUNTRY].value_counts().index.tolist()

### Deterine positive/negative trend count for each section

def get_trend(row):
    conv_change = row[(RPT_COL_CONV, COL_DEVIATION)]
    cvr_change = row[(COL_CONV_RATE, COL_DEVIATION)]

    if conv_change > 0 and cvr_change > 0:
        return 'Positive'
    elif conv_change < 0 and cvr_change < 0:
        return 'Negative'
    return 'Mixed'

# set Trend for each campaign
highlight_campaigns[COL_TREND] = highlight_campaigns.apply(get_trend, axis=1)

# get aggregate count
campaign_category_trend_count = highlight_campaigns[[RPT_COL_CAMPAIGN_CATEGORY, COL_TREND]]
campaign_category_trend_count.columns = campaign_category_trend_count.columns.get_level_values(0)
campaign_category_trend_count = campaign_category_trend_count.value_counts().reset_index(name='count')
campaign_category_trend_count = campaign_category_trend_count.sort_values(by=[RPT_COL_CAMPAIGN_CATEGORY, COL_TREND], ascending=False)

country_trend_count = highlight_campaigns[[COL_COUNTRY, COL_TREND]]
country_trend_count.columns = country_trend_count.columns.get_level_values(0)
country_trend_count = country_trend_count.value_counts().reset_index(name='count')
country_trend_count = country_trend_count.sort_values(by=[COL_COUNTRY, COL_TREND], ascending=False)


### Determine the Most Promenent Metric Trend for each section

def determine_trend(highlight_campaigns, section, most_upward_or_downward_metric, trend_name):
    trend_data = highlight_campaigns.loc[highlight_campaigns[most_upward_or_downward_metric].notna(), [section, most_upward_or_downward_metric]] \
                                    .value_counts() \
                                    .reset_index()
    trend_data[COL_TREND] = trend_name
    trend_data.columns = [section, 'metric', 'count', 'trend']
    return trend_data

category_trend_up = determine_trend(highlight_campaigns, RPT_COL_CAMPAIGN_CATEGORY, COL_MOST_UPWARD_OUTLIER_METRIC, 'Positive')
category_trend_down = determine_trend(highlight_campaigns, RPT_COL_CAMPAIGN_CATEGORY, COL_MOST_DOWNWARD_OUTLIER_METRIC, 'Negative')
country_trend_up = determine_trend(highlight_campaigns, COL_COUNTRY, COL_MOST_UPWARD_OUTLIER_METRIC, 'Positive')
country_trend_down = determine_trend(highlight_campaigns, COL_COUNTRY, COL_MOST_DOWNWARD_OUTLIER_METRIC, 'Negative')

highlight_category_trend = pd.concat([category_trend_up, category_trend_down], axis=0).sort_values(by=[RPT_COL_CAMPAIGN_CATEGORY, 'metric'])
highlight_country_trend = pd.concat([country_trend_up, country_trend_down], axis=0).sort_values(by=[COL_COUNTRY, 'metric'])
print("highlight_category_trend", highlight_category_trend)
print("highlight_country_trend", highlight_country_trend)

top_count_value_category = highlight_category_trend['count'].max()
most_prominent_metric_trend_for_category = highlight_category_trend[highlight_category_trend['count'] == top_count_value_category]
print("most prominent category trend", most_prominent_metric_trend_for_category)

top_count_value_country = highlight_country_trend['count'].max()
most_prominent_metric_trend_for_country = highlight_country_trend[highlight_country_trend['count'] == top_count_value_country]
print("most prominent country trend", most_prominent_metric_trend_for_country)



### Construct Complete Prompt

def get_anomaly_results_for_prompt_string(df, min_count=1):
    if df.empty or len(df) < min_count:
        return pd.DataFrame()
    else:
        df.set_index([RPT_COL_CLIENT, RPT_COL_CAMPAIGN_CATEGORY, RPT_COL_PUBLISHER, RPT_COL_ACCOUNT, RPT_COL_CAMPAIGN, RPT_COL_CURRENCY], inplace=True)

        trailing = df.xs(key=COL_TRAILING, level=1, axis=1, drop_level=False).round(0).astype(int)
        forecast = df.xs(key=COL_FORECAST, level=1, axis=1, drop_level=False).round(2)
        deviation_pct = df.xs(key=COL_DEVIATION_PCT, level=1, axis=1, drop_level=False)
        anomaly_score = df.xs(key=COL_OUTLIER_SCORE, level=1, axis=1, drop_level=False).round(2)
        other_cols = df[[COL_MOST_UPWARD_OUTLIER_METRIC, COL_MOST_DOWNWARD_OUTLIER_METRIC, COL_TREND, COL_ROW_KEY]]
        table = pd.concat([trailing, deviation_pct, forecast, anomaly_score, other_cols], axis=1)
        # sorting removes level 1 col name for each column and confuses GPT
        # table = table.sort_index(axis=1, level=0)
        table = table.reset_index()

        # flatten column names
        table.columns = ['{}_{}'.format(col[0], col[1]) if col[1] else col[0] for col in table.columns]
        # add quotes around column names and values
        quoted_column_names = ['"{}"'.format(col) for col in table.columns]
        table_str = table.to_string(index=False, header=quoted_column_names, formatters={col: lambda x: f'"{x}"' for col in table.columns})

        return table_str

def get_anomaly_results_for_human_dataframe(df):
    if df.empty:
        return pd.DataFrame()
    else:
        df.set_index([RPT_COL_CLIENT, RPT_COL_PUBLISHER, RPT_COL_ACCOUNT, RPT_COL_CAMPAIGN, RPT_COL_CAMPAIGN_CATEGORY, COL_COUNTRY, COL_TREND, RPT_COL_CURRENCY], inplace=True)

        deviation_pct = df.xs(key=COL_DEVIATION_PCT, level=1, axis=1, drop_level=False)
        forecast = df.xs(key=COL_FORECAST, level=1, axis=1, drop_level=False).round(2)
        other_cols = df[[COL_ROW_KEY]]
        table = pd.concat([deviation_pct, forecast, other_cols], axis=1)
        table = table.sort_index(axis=1, level=0)
        trailing = df.xs(key=COL_TRAILING, level=1, axis=1, drop_level=False).round(0).astype(int)
        table = pd.concat([trailing, table], axis=1)
        table = table.reset_index()
        table.columns = ['{}_{}'.format(col[0], col[1]) if col[1] else col[0] for col in table.columns]
        return table
    
## Build Prompt

prompt_categories = "\n\n".join(
f'''

using campaigns listed in {category} dataframe, and campaign counts grouped by campaign category and trend, and write a short summary for each trend.
for each trend, provide the count given in the dataframe, and choose up to 3 campaigns with the largest positive or negative deviation in '{RPT_COL_CONV}' and/or '{COL_CONV_RATE}' from baseline.
each bullet point should start with the campaign name ('{RPT_COL_CAMPAIGN}') as link to row key ('{COL_ROW_KEY}') and bolded trailing cost ('{COL_TRAILING_COST}') with currency symbol).
include actual values when explaining percentage deviation ('METRIC_{COL_DEVIATION_PCT}') from baseline value ('METRIC_{COL_FORECAST}') for both '{RPT_COL_CONV}' and '{COL_CONV_RATE}'.

EXAMPLE:

__{category}__

3 campaigns doing better. Examples:
  * [My Campaign 1](COL_ROW_KEY) (__£3,439__): __Sign-Ups__ experienced a substantial increase of __+60%__ from baseline of 1,345, and __Sign-Up Conv Rate__ lifted __+17%__ from projection of 3.1%.
  * [My Campaign 2](COL_ROW_KEY) (__£2,439__): __Sign-Ups__ experienced a substantial lift of __+30%__ from baseline of 2,345, and __Sign-Up Conv Rate__ improved __+7%__ from projection of 5.1%.

1 campaign facing declines. Examples:
  * [My Campaign 3](COL_ROW_KEY) (__£1,439__): __Sign-Ups__ experienced a substantial drop of __-30%__ from baseline of 3,345, and __Sign-Up Conv Rate__ tanked __-27%__ from projection of 4.1%.

2 campaign with mixed trend. Examples:
  * [My Campaign 4](COL_ROW_KEY) (__£4,439__): __Sign-Ups__ experienced a substantial increase of __+20%__ from baseline of 2,000, but __Sign-Up Conv Rate__ dropped __-15%__ from projection of 3.7%.
'''
    for category in categories
)


dataframe_categories = "\n\n".join(
f'''
DataFrame of "{category}" Campaigns with at least one large anomalous deviation:
[[[
{get_anomaly_results_for_prompt_string(highlight_campaigns.loc[highlight_campaigns[RPT_COL_CAMPAIGN_CATEGORY] == category].head(20))}
]]]
'''
    for category in categories
)

prompt_countries = "\n\n".join(
f'''

using campaigns listed in {country} dataframe, and campaign counts grouped by country and trend, and write a short summary for each trend.
use the full country name instead of ISO country code. example: use Japan instead of JP, France instead of FR, Germany instead of DE.
for each trend, provide the count given in the dataframe, and choose up to 3 campaigns with the largest positive or negative deviation in '{RPT_COL_CONV}' and/or '{COL_CONV_RATE}' from baseline.
each bullet point should start with the campaign name ('{RPT_COL_CAMPAIGN}') as link to row key ('{COL_ROW_KEY}') and bolded trailing cost ('{COL_TRAILING_COST}') with currency symbol).
include actual values when explaining percentage deviation ('METRIC_{COL_DEVIATION_PCT}') from baseline value ('METRIC_{COL_FORECAST}') for both '{RPT_COL_CONV}' and '{COL_CONV_RATE}'.

EXAMPLE:

__{country}__

3 campaigns doing really well. Examples:
  * [My Campaign 1](COL_ROW_KEY) (__£3,439__): __Sign-Ups__ experienced a substantial increase of __+60%__ from baseline of 1,345, and __Sign-Up Conv Rate__ lifted __+17%__ from projection of 3.1%.
  * [My Campaign 2](COL_ROW_KEY) (__£2,439__): __Sign-Ups__ experienced a substantial lift of __+30%__ from baseline of 2,345, and __Sign-Up Conv Rate__ improved __+7%__ from projection of 5.1%.

5 campaign worst than before. Examples:
  * [My Campaign 3](COL_ROW_KEY) (__£1,439__): __Sign-Ups__ experienced a substantial drop of __-30%__ from baseline of 3,345, and __Sign-Up Conv Rate__ tanked __-27%__ from projection of 4.1%.

2 campaign with mixed trend. Examples:
  * [My Campaign 4](COL_ROW_KEY) (__£4,439__): __Sign-Ups__ experienced a substantial increase of __+20%__ from baseline of 2,000, but __Sign-Up Conv Rate__ dropped __-15%__ from projection of 3.7%.
'''
    for country in countries
)

dataframe_countries = "\n\n".join(
f'''
DataFrame of "{country}" Campaigns with at least one large anomalous deviation:
[[[
{get_anomaly_results_for_prompt_string(highlight_campaigns.loc[highlight_campaigns[COL_COUNTRY] == country].head(20))}
]]]
'''
    for country in countries
)

emailSummaryPrompt = f'''
You are a helpful pay-per-click marketing data analyst with deep understanding of common performance issues.

{dataframe_categories}

{dataframe_countries}


DataFrame of the Most Prominent Metric Trend in Campaign Category:
[[[
{most_prominent_metric_trend_for_category.head(1).to_string(index=False)}
]]]


DataFrame of the Most Prominent Metric Trend in Country:
[[[
{most_prominent_metric_trend_for_country.head(1).to_string(index=False)}
]]]


DataFrame of Campaign Count Grouped by Campaign Category and Trend:
[[[
{campaign_category_trend_count.to_string(index=False)}
]]]

DataFrame of Campaign Count Grouped by Country and Trend:
[[[
{country_trend_count.to_string(index=False)}
]]]


Interpret above data using these guidelines:
* Refer to '{RPT_COL_CONV}' as 'Sign-Ups'
* Refer to '{COL_CONV_RATE}' as 'Sign-Up Conv Rate'
* '{RPT_COL_CONV + '_' + COL_FORECAST}' is the baseline value for '{RPT_COL_CONV}'.
* '{COL_CONV_RATE + '_' + COL_FORECAST}' is the baseline value for '{COL_CONV_RATE}'.
* '{RPT_COL_CONV + '_' + COL_DEVIATION_PCT}' is the percent deviation from baseline value for '{RPT_COL_CONV}'
* '{COL_CONV_RATE + '_' + COL_DEVIATION_PCT}' is the percent deviation from baseline value for '{COL_CONV_RATE}'
* Positive '{RPT_COL_CONV + '_' + COL_DEVIATION_PCT}' along with positive '{COL_CONV_RATE + '_' + COL_DEVIATION_PCT}' indicates an overall positive trend
* Negative '{RPT_COL_CONV + '_' + COL_DEVIATION_PCT}' along with negative '{COL_CONV_RATE + '_' + COL_DEVIATION_PCT}' indicates an overall negative trend
* Positive '{RPT_COL_CONV + '_' + COL_DEVIATION_PCT}' along with negative '{COL_CONV_RATE + '_' + COL_DEVIATION_PCT}' indicates an overall mixed trend
* Negative '{RPT_COL_CONV + '_' + COL_DEVIATION_PCT}' along with positive '{COL_CONV_RATE + '_' + COL_DEVIATION_PCT}' also indicates an overall mixed trend
* Campaigns are devided into groups via '{RPT_COL_CAMPAIGN_CATEGORY}'. The groups are: {', '.join(categories)}.


Please summarize the weekly Total AMU Signups using a professional tone, using these rules:
* include Campaign Name from column '{RPT_COL_CAMPAIGN}'
* if column '{COL_MOST_UPWARD_OUTLIER_METRIC}' contains name of a metric, summarize the deviation percentage and baseline value for that metric
* if column '{COL_MOST_DOWNWARD_OUTLIER_METRIC}' contains name of a metric, summarize the deviation percentage and baseline value for that metric

Generate output in Markdown, using this format:

# Change Summary for Last Week: {RPT_COL_CONV.replace(' Conv.', '')}

Note: Weekly metrics are from {report_date_start.strftime('%b %d, %Y')} to {report_date_end.strftime('%b %d, %Y')}, inclusive. Change metrics are compared to a recent 4-week baseline.

# write a succint headline ending with an action verb that highlights Most Prominent Metric Trend for Campaign Category ('{RPT_COL_CAMPAIGN_CATEGORY}').
EXAMPLE: # Brand Conv Rate Dives

{prompt_categories}

# write a succint headline ending with an action verb that highlights Most Prominent Metric Trend for for Country ('{COL_COUNTRY}').
EXAMPLE: # US Sign-Ups Retreat

{prompt_countries}

'''




# blank out prompt if there is no actual output
if highlight_campaigns.empty:
    emailSummaryPrompt = ''

print(f"Prompt has ({len(emailSummaryPrompt)} chars)")

#### email output

outputDf = get_anomaly_results_for_human_dataframe(highlight_campaigns)

debugDf = highlight_campaigns.reset_index().round(2)
debugDf.columns = ['{}_{}'.format(col[0], col[1]) if col[1] else col[0] for col in debugDf.columns]

print(f"OutputDf has {outputDf.shape[0]} rows")

#### deep link
# viewId: 14788881 

config = {
    "email_marinone_links": [
        {
            "index": index + 1,
            "view_id": 14968079,
            "filters": [f"campaign_name:{campaign_name}"]
        }
        for index, campaign_name in enumerate(outputDf[RPT_COL_CAMPAIGN])
    ]
} if not outputDf.empty else {}

# Create the final JSON string by combining all rows
mscripts_output_config  = str(config).replace("'", '"')

print(f"mscript_output_config: {mscripts_output_config}")

## local debug
if local_dev:
    with open('prompt.txt', 'w') as file:
        file.write(emailSummaryPrompt)
        print(f"Local Dev: Prompt written to: {file.name}")

    output_filename = 'outputDf.csv'
    outputDf.to_csv(output_filename, index=False)
    print(f"Local Dev: Output written to: {output_filename}")

    debug_filename = 'debugDf.csv'
    debugDf.to_csv(debug_filename, index=False)
    print(f"Local Dev: Debug written to: {debug_filename}")

else:
    print("====== Prompt =====")
    print(emailSummaryPrompt)
    print("===========")
Post generated on 2025-03-11 01:25:51 GMT
15 Mar 2024
« Script 817: SBA Budget Staging via GSheets Script 821: Total Pub Cost & Pub Cost Percentage Column »
MarinOne Scripts Creator's Corner