Script 1293: Extract Title from Group name
Purpose
The Python script extracts a title from the “Group” column using a regex pattern and inserts it into the “Title” column.
To Elaborate
The Python script is designed to process a DataFrame by extracting a specific part of the text from the “Group” column and placing it into a new “Title” column. This extraction is performed using a regular expression pattern that identifies the desired portion of the text, which is typically the title before any underscore or the phrase “ - Season”. The script also standardizes the capitalization of possessive forms by converting any occurrence of “‘S” to “‘s” in the extracted titles. Additionally, it marks each processed row with a “YES” in the ‘g-check’ column to indicate successful processing. This script is useful for data cleaning and preparation, particularly in scenarios where structured data needs to be extracted from a more complex text field.
Walking Through the Code
- Data Preparation
- The script begins by defining the primary data source and relevant column names, such as ‘Group’, ‘Campaign’, ‘Account’, and ‘Title’.
- It then copies the input DataFrame to an output DataFrame to preserve the original data.
- Title Extraction
- A regular expression pattern
r'^(.*?)(?:[_]|(?= - Season))'
is defined to extract the title from the “Group” column. - The script applies this regex pattern to each entry in the “Group” column, extracting the title and assigning it to the “Title” column.
- A regular expression pattern
- Text Standardization
- The script replaces any occurrence of “‘S” with “‘s” in the “Title” column to ensure consistent capitalization of possessive forms.
- Processing Indicator
- It sets the ‘g-check’ column to “YES” for all rows, indicating that the processing has been completed.
- Output
- The processed DataFrame is returned, containing the newly extracted titles and updated ‘g-check’ status.
Vitals
- Script ID : 1293
- Client ID / Customer ID: 1306912147 / 69058
- Action Type: Bulk Upload
- Item Changed: AdGroup
- Output Columns: Account, Campaign, Group, Title
- Linked Datasource: M1 Report
- Reference Datasource: None
- Owner: Jeremy Brown (jbrown@marinsoftware.com)
- Created by Jeremy Brown on 2024-07-26 10:55
- Last Updated by Jeremy Brown on 2024-07-31 11:49
> See it in Action
Python Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
##
## Name: Extract Title from Group name
## description: Use the regex pattern = r'^(.*?)(?:[_]|(?= - Season))' to extract the title from the "Group" column and insert into `Title` dimension column.
##
## author: Jeremy Brown
## created: 2024-07-26
##
today = datetime.datetime.now(CLIENT_TIMEZONE).date()
# primary data source and columns
inputDf = dataSourceDict["1"]
RPT_COL_GROUP = 'Group'
RPT_COL_CAMPAIGN = 'Campaign'
RPT_COL_ACCOUNT = 'Account'
RPT_COL_TITLE = 'Title'
RPT_COL_IMPR = 'Impr.'
RPT_COL_STATUS = 'Campaign Status'
RPT_COL_STATUS = 'Group Status'
RPT_COL_CHECK = 'g-check'
# Function to process the input DataFrame and return the output DataFrame
def process(inputDf):
# Define the regex pattern to extract the title from the "Group" column
regex_pattern = r'^(.*?)(?:[_]|(?= - Season))'
# Copy the input DataFrame to the output DataFrame
outputDf = inputDf.copy()
# Extract the title using the regex pattern and assign it to the "Title" column
outputDf[RPT_COL_TITLE] = outputDf[RPT_COL_GROUP].apply(
lambda x: re.match(regex_pattern, x).group(1) if re.match(regex_pattern, x) else ""
)
# Convert any occurrence of "'S" to "'s" in the 'Title' column
outputDf[RPT_COL_TITLE] = outputDf[RPT_COL_TITLE].str.replace("'S", "'s")
# Set the 'g-check' column to "YES"
outputDf[RPT_COL_CHECK] = "YES"
# Print the changed data for debugging
print(outputDf)
return outputDf
# Trigger the main process
outputDf = process(inputDf)
Post generated on 2024-11-27 06:58:46 GMT