Script 1089: latency

Purpose

Python script to apply a latency curve adjustment to GEM data.

To Elaborate

This Python script is designed to ingest GEM data and apply a latency curve adjustment to the data. The latency curve adjustment is based on a set of parameters defined in a Google document. The script then outputs the adjusted data.

Walking Through the Code

  1. The script starts by defining constants for device types and match types.
  2. The current date is obtained using the datetime.datetime.now() function.
  3. The primary data source is assigned to the variable inputDf.
  4. An optional filter is applied to inputDf to only include rows related to a specific conversion type.
  5. The ‘Creative ID’ column in inputDf_filtered is converted to a string and any trailing “.0” is removed from each value.
  6. The ‘Group ID’ column in inputDf_filtered is converted to a string and any trailing “.0” is removed from each value.
  7. The ‘Campaign ID’ column in inputDf_filtered is converted to a string and any trailing “.0” is removed from each value.
  8. The ‘Product ID’ column in inputDf_filtered is converted to a string and any trailing “.0” is removed from each value.
  9. The ‘Date’ column in inputDf_filtered is converted to a datetime series with timezone information.
  10. The today variable is converted to a pandas Timestamp with the same timezone as the ‘Date’ column in inputDf_filtered.
  11. The difference in days between today_timestamp and inputDf_filtered['DateStamp'] is calculated and assigned to Date_diff.
  12. Parameters for the latency curve adjustment are defined.
  13. The latency factor is calculated based on the parameters and the Date_diff.
  14. The latency factor is limited to a maximum of 1 if the Date_diff is greater than 29.
  15. The latency factor is limited to a maximum of 1 if it is greater than 1.
  16. Output columns and initial values are defined.
  17. The output dataframe is populated with the adjusted data.
  18. Duplicates are removed from the output dataframe.
  19. Custom output columns are allowed if skip_output_validations is set to True.
  20. The tableize function is used to print the first few rows of the inputDf_filtered dataframe.

Vitals

  • Script ID : 1089
  • Client ID / Customer ID: 1306927637 / 64302
  • Action Type: Revenue Upload
  • Item Changed: None
  • Output Columns: Date, Keyword ID, Creative ID, Group ID, Campaign ID, Product ID, Target IDs, Currency, Device, Comments, Google Conv.Code New (Click Date) Conv, Google Conv.Code New (Click Date) Rev, Google Conv.Code New - Latency Adjusted (Click Date) Conv, Google Conv.Code New - Latency Adjusted (Click Date) Rev
  • Linked Datasource: FTP/Email Feed
  • Reference Datasource: None
  • Owner: Tom McCaughey (tmccaughey@marinsoftware.com)
  • Created by Tom McCaughey on 2024-05-14 12:42
  • Last Updated by Tom McCaughey on 2024-05-14 12:42
> See it in Action

Python Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
##
## name: latency test
## description: ingest GEM data and apply a latency curve adjustment as per https://docs.google.com/document/d/1MactW0CdcDjQ7-vDbaex2KOGStiB_vnypaFsXI2dg-o/edit?usp=sharing
##  
## 
## author: Tom McCaughey
## created: 2024-04-30
## 

DEVICE = {
  'MOBILE': 'm',
  'DESKTOP': 'c',
  'TABLET': 't',
}
MATCH_TYPE = {
  'EXACT': 'exact',
  'PHRASE': 'phrase',
  'BROAD': 'broad',
}
today = datetime.datetime.now(CLIENT_TIMEZONE).date()

# primary data source and columns
inputDf = dataSourceDict["1"]

#optional filter to only include the source rows relating to the conversion type we want
inputDf_filtered = inputDf[inputDf['Conversion Type'] == 'GoogleConvCodeNew'].copy()

# Convert 'Creative ID' column to string
inputDf_filtered['Creative ID'] = inputDf_filtered['Creative ID'].astype(str)

# Remove ".0" from the end of each value
inputDf_filtered['Creative ID'] = inputDf_filtered['Creative ID'].str.replace('\.0$', '', regex=True)

# Convert 'Group ID' column to string
inputDf_filtered['Group ID'] = inputDf_filtered['Group ID'].astype(str)
# Remove ".0" from the end of each value
inputDf_filtered['Group ID'] = inputDf_filtered['Group ID'].str.replace('\.0$', '', regex=True)

# Convert 'Campaign ID' column to string
inputDf_filtered['Campaign ID'] = inputDf_filtered['Campaign ID'].astype(str)
# Remove ".0" from the end of each value
inputDf_filtered['Campaign ID'] = inputDf_filtered['Campaign ID'].str.replace('\.0$', '', regex=True)

# Convert 'Product ID' column to string
inputDf_filtered['Product ID'] = inputDf_filtered['Product ID'].astype(str)
# Remove ".0" from the end of each value
inputDf_filtered['Product ID'] = inputDf_filtered['Product ID'].str.replace('\.0$', '', regex=True)

# Ensure 'inputDf_filtered['Date']' is a datetime Series with timezone information
inputDf_filtered['DateStamp'] = pd.to_datetime(inputDf_filtered['Date']).dt.tz_localize(CLIENT_TIMEZONE)


# Convert 'today' to a pandas Timestamp with the same timezone as 'inputDf_filtered['Date']'
today_timestamp = pd.Timestamp(today, tz=CLIENT_TIMEZONE)
Date_diff = (today_timestamp - inputDf_filtered['DateStamp']).dt.days

#params for latency curve as per - https://docs.google.com/document/d/1MactW0CdcDjQ7-vDbaex2KOGStiB_vnypaFsXI2dg-o/edit

latency_D = 1
latency_C = 1
latency_B = 1

calc_latency_factor = latency_D/(1+(Date_diff/latency_C)**(-latency_B))

#limit calc_latency_factor to a max of 1 based on days difference
calc_latency_factor[Date_diff > 29] = 1

calc_latency_factor[calc_latency_factor > 1] = 1

# output columns and initial values
REVENUE_COL_DATE = 'Date'
REVENUE_COL_CURRENCY = 'Currency'
REVENUE_COL_DEVICE = 'Device'
REVENUE_COL_KEYWORD_ID = 'Keyword ID'
REVENUE_COL_CREATIVE_ID = 'Creative ID'
##REVENUE_COL_KEYWORD = 'Keyword'
##REVENUE_COL_MATCH_TYPE = 'Match Type'
REVENUE_COL_GROUP_ID = 'Group ID'
REVENUE_COL_CAMPAIGN_ID = 'Campaign ID'
REVENUE_COL_PRODUCT_ID = 'Product ID'
##REVENUE_COL_SL_ID = 'SL ID'
##REVENUE_COL_ADEXTENSION_ID = 'AdExtension ID'
REVENUE_COL_TARGET_IDS = 'Target IDs'
REVENUE_COL_COMMENTS = 'Comments'
REVENUE_COL_GOOGLE_CONVCODE_NEW_CLICK_DATE_CONV = 'Google Conv.Code New Conv'
REVENUE_COL_GOOGLE_CONVCODE_NEW_CLICK_DATE_REV = 'Google Conv.Code New Rev'
REVENUE_COL_GOOGLE_CONVCODE_NEW__LATENCY_ADJUSTED_CLICK_DATE_CONV = 'Google Conv.Code New - Latency Adjusted (Click Date) Conv'
REVENUE_COL_GOOGLE_CONVCODE_NEW__LATENCY_ADJUSTED_CLICK_DATE_REV = 'Google Conv.Code New - Latency Adjusted (Click Date) Rev'
outputDf[REVENUE_COL_DATE] = inputDf_filtered['Date']
outputDf[REVENUE_COL_CURRENCY] = inputDf_filtered['Currency']
outputDf[REVENUE_COL_DEVICE] = inputDf_filtered['Device']
outputDf[REVENUE_COL_KEYWORD_ID] = ""
outputDf[REVENUE_COL_CREATIVE_ID] = inputDf_filtered['Creative ID']
##outputDf[REVENUE_COL_KEYWORD] = inputDf_filtered['Keyword']
##outputDf[REVENUE_COL_MATCH_TYPE] = inputDf['Match Type']
outputDf[REVENUE_COL_GROUP_ID] = inputDf_filtered['Group ID']
outputDf[REVENUE_COL_CAMPAIGN_ID] = inputDf_filtered['Campaign ID']
outputDf[REVENUE_COL_PRODUCT_ID] = inputDf_filtered['Product ID']
##outputDf[REVENUE_COL_SL_ID] = inputDf_filtered['SL ID']
##outputDf[REVENUE_COL_ADEXTENSION_ID] = inputDf_filtered['AdExtension ID']
outputDf[REVENUE_COL_TARGET_IDS] = inputDf_filtered['Target IDs']
outputDf[REVENUE_COL_COMMENTS] = "Date difference = " + Date_diff.astype(str) +", calc_latency_factor is " + calc_latency_factor.astype(str)
outputDf[REVENUE_COL_GOOGLE_CONVCODE_NEW_CLICK_DATE_CONV] = inputDf_filtered['Conversions']
outputDf[REVENUE_COL_GOOGLE_CONVCODE_NEW_CLICK_DATE_REV] = inputDf_filtered['Revenue']
outputDf[REVENUE_COL_GOOGLE_CONVCODE_NEW__LATENCY_ADJUSTED_CLICK_DATE_CONV] = (inputDf_filtered['Conversions'] /calc_latency_factor)
outputDf[REVENUE_COL_GOOGLE_CONVCODE_NEW__LATENCY_ADJUSTED_CLICK_DATE_REV] = (inputDf_filtered['Revenue'] /calc_latency_factor)

#remove duplicates
outputDf = outputDf.drop_duplicates()

#allow for custom output columns
skip_output_validations = True

# user code start here
print(tableize(inputDf_filtered.head()))

Post generated on 2024-05-15 07:44:05 GMT

comments powered by Disqus