Script 1293: Extract Title from Group name

Purpose

The Python script extracts a title from the “Group” column using a regex pattern and inserts it into the “Title” column.

To Elaborate

The Python script is designed to process a DataFrame by extracting a specific part of the text from the “Group” column and placing it into a new “Title” column. This extraction is performed using a regular expression pattern that identifies the desired portion of the text, which is typically the title before any underscore or the phrase “ - Season”. The script also standardizes the capitalization of possessive forms by converting any occurrence of “‘S” to “‘s” in the extracted titles. Additionally, it marks each processed row with a “YES” in the ‘g-check’ column to indicate successful processing. This script is useful for data cleaning and preparation, particularly in scenarios where structured data needs to be extracted from a more complex text field.

Walking Through the Code

  1. Data Preparation
    • The script begins by defining the primary data source and relevant column names, such as ‘Group’, ‘Campaign’, ‘Account’, and ‘Title’.
    • It then copies the input DataFrame to an output DataFrame to preserve the original data.
  2. Title Extraction
    • A regular expression pattern r'^(.*?)(?:[_]|(?= - Season))' is defined to extract the title from the “Group” column.
    • The script applies this regex pattern to each entry in the “Group” column, extracting the title and assigning it to the “Title” column.
  3. Text Standardization
    • The script replaces any occurrence of “‘S” with “‘s” in the “Title” column to ensure consistent capitalization of possessive forms.
  4. Processing Indicator
    • It sets the ‘g-check’ column to “YES” for all rows, indicating that the processing has been completed.
  5. Output
    • The processed DataFrame is returned, containing the newly extracted titles and updated ‘g-check’ status.

Vitals

  • Script ID : 1293
  • Client ID / Customer ID: 1306912147 / 69058
  • Action Type: Bulk Upload
  • Item Changed: AdGroup
  • Output Columns: Account, Campaign, Group, Title
  • Linked Datasource: M1 Report
  • Reference Datasource: None
  • Owner: Jeremy Brown (jbrown@marinsoftware.com)
  • Created by Jeremy Brown on 2024-07-26 10:55
  • Last Updated by Jeremy Brown on 2024-07-31 11:49
> See it in Action

Python Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
##
## Name: Extract Title from Group name
## description: Use the regex pattern = r'^(.*?)(?:[_]|(?= - Season))' to extract the title from the "Group" column and insert into `Title` dimension column.
## 
## author: Jeremy Brown 
## created: 2024-07-26
## 

today = datetime.datetime.now(CLIENT_TIMEZONE).date()

# primary data source and columns
inputDf = dataSourceDict["1"]
RPT_COL_GROUP = 'Group'
RPT_COL_CAMPAIGN = 'Campaign'
RPT_COL_ACCOUNT = 'Account'
RPT_COL_TITLE = 'Title'
RPT_COL_IMPR = 'Impr.'
RPT_COL_STATUS = 'Campaign Status'
RPT_COL_STATUS = 'Group Status'
RPT_COL_CHECK = 'g-check'

# Function to process the input DataFrame and return the output DataFrame
def process(inputDf):
    # Define the regex pattern to extract the title from the "Group" column
    regex_pattern = r'^(.*?)(?:[_]|(?= - Season))'
    
    # Copy the input DataFrame to the output DataFrame
    outputDf = inputDf.copy()

    # Extract the title using the regex pattern and assign it to the "Title" column
    outputDf[RPT_COL_TITLE] = outputDf[RPT_COL_GROUP].apply(
        lambda x: re.match(regex_pattern, x).group(1) if re.match(regex_pattern, x) else ""
    )
    
    # Convert any occurrence of "'S" to "'s" in the 'Title' column
    outputDf[RPT_COL_TITLE] = outputDf[RPT_COL_TITLE].str.replace("'S", "'s")

    # Set the 'g-check' column to "YES"
    outputDf[RPT_COL_CHECK] = "YES"
    
    # Print the changed data for debugging
    print(outputDf)

    return outputDf

# Trigger the main process
outputDf = process(inputDf)

Post generated on 2024-11-27 06:58:46 GMT

comments powered by Disqus