Script 1293: Extract Title from Group name

Purpose:

The Python script extracts a title from a “Group” column using a regex pattern and updates the “Title” column accordingly.

To Elaborate

The Python script is designed to process a DataFrame by extracting a specific part of the text from the “Group” column and assigning it to the “Title” column. This is achieved using a regular expression pattern that identifies the desired portion of the text. The script also standardizes the text format by converting any occurrence of “‘S” to “‘s” in the “Title” column. Additionally, it marks a “g-check” column with “YES” to indicate that the processing has been completed. This script is useful for data cleaning and preparation, particularly in scenarios where structured data is required for further analysis or reporting.

Walking Through the Code

  1. Data Source Initialization:
    • The script begins by defining the primary data source, inputDf, which is a DataFrame containing various columns including “Group”, “Campaign”, “Account”, “Title”, “Impr.”, “Campaign Status”, “Group Status”, and “g-check”.
  2. Function Definition:
    • A function named process is defined to handle the transformation of the input DataFrame.
    • Within this function, a regex pattern r'^(.*?)(?:[_]|(?= - Season))' is specified to extract the title from the “Group” column.
  3. DataFrame Processing:
    • The input DataFrame is copied to outputDf to preserve the original data.
    • The regex pattern is applied to the “Group” column to extract the title, which is then assigned to the “Title” column.
    • The script replaces any occurrence of “‘S” with “‘s” in the “Title” column to ensure consistency in text formatting.
  4. Final Adjustments:
    • The “g-check” column is set to “YES” for all rows, indicating that the processing step has been completed.
    • The processed DataFrame is printed for debugging purposes.
  5. Execution:
    • The process function is called with inputDf as an argument, and the resulting DataFrame is stored in outputDf.

Vitals

  • Script ID : 1293
  • Client ID / Customer ID: 1306912147 / 69058
  • Action Type: Bulk Upload
  • Item Changed: AdGroup
  • Output Columns: Account, Campaign, Group, Title
  • Linked Datasource: M1 Report
  • Reference Datasource: None
  • Owner: Jeremy Brown (jbrown@marinsoftware.com)
  • Created by Jeremy Brown on 2024-07-26 10:55
  • Last Updated by Jeremy Brown on 2024-07-31 11:49
> See it in Action

Python Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
##
## Name: Extract Title from Group name
## description: Use the regex pattern = r'^(.*?)(?:[_]|(?= - Season))' to extract the title from the "Group" column and insert into `Title` dimension column.
## 
## author: Jeremy Brown 
## created: 2024-07-26
## 

today = datetime.datetime.now(CLIENT_TIMEZONE).date()

# primary data source and columns
inputDf = dataSourceDict["1"]
RPT_COL_GROUP = 'Group'
RPT_COL_CAMPAIGN = 'Campaign'
RPT_COL_ACCOUNT = 'Account'
RPT_COL_TITLE = 'Title'
RPT_COL_IMPR = 'Impr.'
RPT_COL_STATUS = 'Campaign Status'
RPT_COL_STATUS = 'Group Status'
RPT_COL_CHECK = 'g-check'

# Function to process the input DataFrame and return the output DataFrame
def process(inputDf):
    # Define the regex pattern to extract the title from the "Group" column
    regex_pattern = r'^(.*?)(?:[_]|(?= - Season))'
    
    # Copy the input DataFrame to the output DataFrame
    outputDf = inputDf.copy()

    # Extract the title using the regex pattern and assign it to the "Title" column
    outputDf[RPT_COL_TITLE] = outputDf[RPT_COL_GROUP].apply(
        lambda x: re.match(regex_pattern, x).group(1) if re.match(regex_pattern, x) else ""
    )
    
    # Convert any occurrence of "'S" to "'s" in the 'Title' column
    outputDf[RPT_COL_TITLE] = outputDf[RPT_COL_TITLE].str.replace("'S", "'s")

    # Set the 'g-check' column to "YES"
    outputDf[RPT_COL_CHECK] = "YES"
    
    # Print the changed data for debugging
    print(outputDf)

    return outputDf

# Trigger the main process
outputDf = process(inputDf)

Post generated on 2025-03-11 01:25:51 GMT

comments powered by Disqus