Script 1295: Extract Search Phrase from Keyword

Purpose

The Python script processes a DataFrame to extract and replace search phrases from keywords based on a given title, creating a new column with the modified keyword.

To Elaborate

The script is designed to process a DataFrame containing marketing data, specifically focusing on keywords and titles. It aims to identify and extract search phrases from the ‘Keyword’ column by matching them with the ‘Title’ column. The script cleans both the keyword and title by removing special characters and converting them to lowercase for accurate comparison. If a match is found, the script replaces the matching part of the keyword with a placeholder ‘[TITLE]’. This processed keyword is stored in a new column called ‘Verb’. The script ensures that the original keyword’s special characters and casing are preserved in the final output. The processed data is then returned as a new DataFrame, which includes the original data along with the newly created ‘Verb’ column.

Walking Through the Code

  1. Initialization: The script begins by defining a function clean_string to remove special characters and convert strings to lowercase. This function is used to prepare the ‘Keyword’ and ‘Title’ for comparison.

  2. Processing Function: The process function is defined to handle the main logic. It initializes an empty list to store processed rows and iterates over each row in the input DataFrame.

  3. Data Cleaning and Matching:
    • For each row, the ‘Keyword’ and ‘Title’ are cleaned using the clean_string function.
    • The script checks if the cleaned title (with and without spaces) is present in the cleaned keyword.
    • If a match is found, the matching part of the keyword is replaced with ‘[TITLE]’ using regular expressions.
  4. Row Construction: A new dictionary is created for each row, containing the original data and the newly created ‘Verb’ column. The ‘k-check’ column is set to “YES”.

  5. DataFrame Creation: The list of dictionaries is converted into a new DataFrame, which includes the processed data.

  6. Output: The processed DataFrame is returned, containing the original columns and the newly added ‘Verb’ column.

Vitals

  • Script ID : 1295
  • Client ID / Customer ID: 1306912147 / 69058
  • Action Type: Bulk Upload
  • Item Changed: Keyword
  • Output Columns: Account, Campaign, Group, Keyword, Match Type, Verb
  • Linked Datasource: M1 Report
  • Reference Datasource: None
  • Owner: Jeremy Brown (jbrown@marinsoftware.com)
  • Created by Jeremy Brown on 2024-07-26 13:04
  • Last Updated by Jeremy Brown on 2024-07-31 12:07
> See it in Action

Python Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
##
## Name: Extract Search Phrase from Keyword
## Description:
    ## 1. Initialize: Sets up an empty list to collect processed rows.
    ## 2. Special Character Removal: Uses a regular expression to remove specified special characters from the Title.
    ## 3. Iterate Rows: Loops through each row in the input DataFrame.
    ## 4. Title Processing: Converts the Title to lowercase and removes spaces and special characters for comparison.
    ## 5. Keyword Matching: Checks if the cleaned Title (with and without spaces) is present in the Keyword. Replaces the matching part with [TITLE].
    ## 6. Construct Rows: Creates a dictionary for each row with the processed Verb and appends it to the list.
    ## 7. Create DataFrame: Converts the list of dictionaries into a new DataFrame.
    ## 8. Print for Debugging: Outputs the changed DataFrame for verification.
    ## 9. Return: Returns the processed DataFrame.
    ## FYI This code also ensures that special characters in the Title are removed before matching and processing the Keyword.

## author: Jeremy Brown
## created: 2024-07-26 

today = datetime.datetime.now(CLIENT_TIMEZONE).date()

# primary data source and columns
inputDf = dataSourceDict["1"]
RPT_COL_KEYWORD = 'Keyword'
RPT_COL_MATCH_TYPE = 'Match Type'
RPT_COL_GROUP = 'Group'
RPT_COL_CAMPAIGN = 'Campaign'
RPT_COL_ACCOUNT = 'Account'
RPT_COL_TITLE = 'Title'
RPT_COL_VERB = 'Verb'
RPT_COL_STATUS = 'Status'
RPT_COL_IMPR = 'Impr.'
RPT_COL_CHECK = 'k-check'

def clean_string(s):
    """
    Clean the input string by removing special characters and converting to lowercase.
    """
    special_chars = r"[\\(\)\*,\/\\|]"
    return re.sub(special_chars, "", s).lower()

def process(inputDf):
    """
    Process the input DataFrame to populate the output DataFrame with specific columns
    and create a new 'Verb' column based on the 'Keyword' and 'Title'.
    """
    # Initialize a list to collect new rows
    rows = []
    
    # Iterate through each row in the input DataFrame
    for idx, row in inputDf.iterrows():
        original_keyword = row[RPT_COL_KEYWORD]
        keyword_lower = original_keyword.lower()
        title = str(row[RPT_COL_TITLE])  # Ensure Title is a string
        
        # Clean both keyword and title for comparison
        cleaned_keyword = clean_string(keyword_lower)
        cleaned_title = clean_string(title)
        cleaned_title_no_spaces = cleaned_title.replace(" ", "")
        
        # Check if the cleaned title or title without spaces is in the cleaned keyword
        if cleaned_title in cleaned_keyword:
            # Use regex to replace the cleaned title in the original keyword, ignoring case
            verb = re.sub(re.escape(cleaned_title), "[TITLE]", keyword_lower, flags=re.IGNORECASE).strip()
        elif cleaned_title_no_spaces in cleaned_keyword:
            # Use regex to replace the cleaned title without spaces in the original keyword, ignoring case
            verb = re.sub(re.escape(cleaned_title_no_spaces), "[TITLE]", keyword_lower, flags=re.IGNORECASE).strip()
        else:
            verb = original_keyword  # No match found; use the original keyword

        # Replace with the original keyword's special characters and casing preserved
        verb = re.sub(re.escape(clean_string(original_keyword)), verb, original_keyword, flags=re.IGNORECASE)

        # Capitalize the first letter of the verb if the original keyword started with a capital letter
        if original_keyword[0].isupper():
            verb = verb.capitalize()

        # Construct the new row as a dictionary
        new_row = {
            RPT_COL_KEYWORD: row[RPT_COL_KEYWORD],
            RPT_COL_MATCH_TYPE: row[RPT_COL_MATCH_TYPE],
            RPT_COL_GROUP: row[RPT_COL_GROUP],
            RPT_COL_CAMPAIGN: row[RPT_COL_CAMPAIGN],
            RPT_COL_ACCOUNT: row[RPT_COL_ACCOUNT],
            RPT_COL_TITLE: row[RPT_COL_TITLE],
            RPT_COL_VERB: verb,
            RPT_COL_CHECK: "YES"  # Set 'k-check' to "YES"
        }

        # Add the new row to the list
        rows.append(new_row)
    
    # Create a DataFrame from the list of rows
    df_out = pd.DataFrame(rows, columns=[
        RPT_COL_KEYWORD,
        RPT_COL_MATCH_TYPE,
        RPT_COL_GROUP,
        RPT_COL_CAMPAIGN,
        RPT_COL_ACCOUNT,
        RPT_COL_TITLE,
        RPT_COL_VERB,
        RPT_COL_CHECK
    ])
    
    # Print the data changed for debugging purposes
    print("Data changed:")
    print(df_out)
    
    return df_out

# Trigger the main process
outputDf = process(inputDf)

Post generated on 2024-11-27 06:58:46 GMT

comments powered by Disqus