Regexp 8211 How To Make Preparing Data Easier
Management Summary
We explain the 3 most important and useful RegExp formulas and how you can use them::
- REGEXP_EXTRACT
- REGEXP_MATCH
- REGEXP_CONTAINS
Prepare data with RegExp formulas
Basically, a RegExp formula is made up of the data source, the dimension and the formula itself. For example, Campaign Manager, DV360, and Google Analytics are commonly used data sources for dashboards.
The dimensions come from the data source. “Placement” can be taken from a Campaign Manager data source and “Insertion Order” from a DV360 data source – but mixing does not work. This requires a data blend specifically for the dashboard, or a BigQuery data source that can be accessed again and again.
There are two different ways to create RegExp formulas in Data Studio. Only for a specific table or as another dimension for the entire dashboard.
Create a RegExp formula in Data Studio:1.: For the selected table / 2.: For the dashboard
To use RegExp formulas, we still need a basic understanding of the different characters that can be used to create the formula. There is stillsome more, but here the most important ones are briefly explained:
The dot and the asterisk
To replace words and numbers in a RegExp formula without having to specify every possible combination, dot-asterisk ( .* ) can be used. The dot ( . ) is a symbol for a character in RegExp. So it stands for letters, numbers and special characters that can be used in the name.
The asterisk ( * ) indicates that there is a longer number of characters – thus covering an entire word. Dot-Asterisk replaces the characters in a formula until it encounters the next character specified in the RegExp formula. Therefore, it is common to use hyphae ( – ) or underscores ( _ ) in terms to simplify working with RegExp formulas.
For example, for
Awareness_Subject
Engagement_Subject
not two formulas can be created, but only one
.*_Subject
In this case, the dot asterisk replaces both “awareness” and “engagement”.
The pipe
Pipes ( | ) are used in RegExp formulas to specify two words at the same time in a formula. For example, several formats can be taken together:
(HPA|MediumRectangle)
The brackets and quotation marks
Brackets are used in two different ways in RegExp formulas. On the one hand, every RegExp formula needs a round bracket ( ( & ) ) at the beginning and end of the function to delimit it.
The second function of the parentheses is to highlight part of the dimension name. More about this in the examples for REGEXP_EXTRACT and REGEXP_MATCH
The quotation marks (“ “) are placed at the beginning and end of the text used in the RegExp formula. Both for the name of the dimension and for the self-selected text that is specified in the formula.
The REGEXP data set
The following data set is used for the following examples:
Display_Awareness_HPA_Cat
Display_Awareness_HPA_Dog
Display_Awareness_MediumRectangle_Cat
Display_Awareness_MediumRectangle_Dog
Display_Engagement_HPA_Cat
Display_Engagement_HPA_Dog
Display_Engagement_MediumRectangle_Cat
Display_Engagement_MediumRectangle_Dog
From this, information can be drawn about the channel (display or video), the funnel level (awareness or engagement), the format (HPA, medium rectangle or preroll) and the subject (cat or dog).
In order to use RegExp formulas, it is important that the naming of the dimensions is consistent and correct.
REGEXP_EXTRACT
The EXTRACT formula is used to pull individual words from the dimension. For example, the funnel stage – awareness or engagement – can be taken from the data set:
REGEX_EXTRACT (Placement,(Awareness|Engagement))
Result:
Commitment
Awareness
Since there is both awareness and engagement, these are summarized with a pipe in the formula. The example data set of 12 lines becomes 2 – one line per funnel level.
However, the formula can also be applied more flexibly. If there are further funnel levels (e.g. performance or loyalty), these can either be added or read out using the dot asterisk ( .* ).
The advantage of the dot asterisk ( .* ) is that not all possible funnel stages have to be listed. This flexibility is particularly helpful with varying names, such as subject names.
REGEX_EXTRACT (Placement,”Display_(.*)_.*”)
Display_Engagement_MediumRectangle_Cat
Display_Awareness_HPA_Dog
Result:
Commitment
Awareness
This formula is a little more complex in structure because the exact words are no longer specified. The formula must go to the place where the words are located.
The structure of the formula is as follows:
REGEX_EXTRACT ( )
Placement
,
= thefunctionthat is used
= thedimensionthat is used
= the decimal place after the dimension indicates that now theRulethe functionfollows.
”Display_
(.*)
_.*”)
= the beginning of the placement name
= with theBracketsis signaled that this word – replaced by dot-asterisk –taken outwill
= through the last dot asterisk therestthe placement namereplaced.
With thequotation marksbecomes the dimension nameenclosedand with thatround brackettheRegExp formula.
In the example data set, all lines start the same, with “Display_”, which makes use of the formula. However, if the lines have different beginnings, such as “Display_” and “Video_”, the beginning can also be replaced with a dot asterisk ( .* ).
REGEX_EXTRACT(Placement,”.*_(.*)_.*”)
Display_Engagement_MediumRectangle_Cat
Display_Awareness_HPA_Dog
Result:
Commitment
Awareness
Using the same logic, words that later appear in the placement name can also be removed. For example, this formula takes out the format:
REGEX_EXTRACT(Placement,”.*_.*_(.*)_.*”)
Display_Engagement_MediumRectangle_Cat
Display_Awareness_HPA_Dog
Result:
MediumRectangle
HPA
EXTRACT is suitable for quickly summarizing and analyzing data.
REGEXP_MATCH
The MATCH function is used in conjunction with another function. Data Studio can be used to assign all matched rows to a category.
CASE
WHEN (REGEXP_MATCH Formula 1) THEN “Category 1”
WHEN (REGEXP_MATCH Formula 2) THEN “Category 2”
ELSE “missing value”
END
Since this is a self-programmed field, it is important to use CASE at the beginning and END at the end to narrow down the function.
WHEN describes the condition under which THEN occurs. ELSE must be used with WHEN formulas – this makes it possible to combine everything that does not fall into the previously determined categories, or to display an error message.
The structure of the pure MATCH RegExp formula is similar to the EXTRACT formula:
REGEXP_MATCH(Placement,”.*_(Cat|Dog)”)
Display_Engagement_Medium Rectangle_Cat
Display_Awareness_HPA_Dog
Together with the WHEN – THEN function, for example, a category can be defined for the subject:
CASE
WHEN REGEXP_MATCH(Placement,”.*_(Cat|Dog)”) THEN “Animal”
ELSE (“other”)
END
Result:
animal
The MATCH formula is also suitable for reprocessing abbreviations that are in the placement name:
CASE
WHEN REGEXP_MATCH(Placement,”.*_(HPA)_.*”) THEN “Half Page Ad”
ELSE (“other”)
END
Result:
Half Page Ad
MATCH in conjunction with WHEN – THEN is one of the most flexible functions in Data Studio.
REGEXP_CONTAINS
Unlike the functions discussed so far, CONTAINS does not return a word or category, but rather whether a statement is TRUE or FALSE.
It should therefore be the formula:
REGEX_CONTAINS (Placement,”Display”)
Display_Engagement_MediumRectangle_Cat
Display_Awareness_HPA_Dog
result in TRUE for all lines of the example data set.
CONTAINS is not used in data analysis, but rather helps troubleshooting to find typos or incorrect naming orders.
Conclusion:
With RegExp formulas, data sets can be summarized, analyzed and controlled efficiently, and they also make working with large data sets easier. We promise, after the initial shock is over, it’ll be great fun!