Mastering Regular Expressions in C# (Part 2)
Matching and Extracting Substrings with RegEx in C#
In the first part of this series, we explored how to use Regular Expressions (RegEx) for pattern matching in C# and covered examples of enforcing naming conventions and password requirements. Now, let’s continue our RegEx discussion by focusing on two more uses: matching and extracting substrings from text.
Matching Patterns
To find the first occurrence of a pattern in a given text, we use the Regex.Match() method. It returns a Match object containing information about the match, such as the matched value, index, and length. Let’s look at an example:
string text = "Susy sells seashells by the seashore.";
string pattern = "sea";
Match match = Regex.Match(text, pattern);
if (match.Success)
{
Console.WriteLine($"Match found. Value: {match.Value}, index: {match.Index}, length: {match.Length}");
}
Output: Match found. Value: sea, index 11, length: 3
In this example, the pattern “sea” is matched against the text “Susy sells seashells by the seashore.” We use a Match object to store the information about the matched pattern, which is then printed to the console. Remember, Regex.Match() only returns the first match in the given text. To return all instances of a pattern, use Regex.Matches().
string text = "Julia ate a green grape.";
string pattern = @"\b\w{5}\b";
MatchCollection matches = Regex.Matches(text, pattern);
if (matches.Count > 0)
{
Console.WriteLine($"{matches.Count} matches found.");
for (int i = 0; i < matches.Count; i++)
{
Console.WriteLine($"Match {i + 1}: {matches[i].Value}");
}
}
Output: 3 matches found.
Match 1: Julia
Match 2: green
Match 3: grape
In this example, we used MatchCollection to capture all instances of 5-letter words in the sentence “Julia ate a green grape.” MatchCollection represents a collection of Match objects. The pattern @”\b\w{5}\b” ensures we match standalone five-letter words. Let’s break down this pattern in more detail:
In C#, the ‘@’ symbol before a string is known as a verbatim string literal. It allows escape characters, such as ‘\n’ for a new line, to be treated as literal characters instead of escape sequences. This is important when declaring a RegEx pattern since backslashes are commonly used to define different parts of the pattern.
The ‘\b’ metacharacter in RegEx represents a word boundary anchor. It matches the position between a word character (e.g., letters, digits, underscores) and a non-word character (e.g., spaces, punctuation, start/end of the string). We use this anchor to ensure that the word we are searching for is not part of a longer word, for example, matching “sea” as a stand-alone word in “seashore.”
Using ‘\w{5}’, we instruct RegEx to match any sequence of five word characters (letters, digits, underscores), effectively finding a five-letter word.
Finally, we end the pattern with another word boundary anchor ‘\b’ to ensure that the five-letter word is bound by non-word characters on each side.
Extracting Substrings
To extract substrings using RegEx, we can use capturing groups, which are parts of the regular expression enclosed in parentheses. They allow us to extract specific portions of the matched text. Let’s see an example:
string text = "My name is John Smith. I can be reached at john@example.com.";
string pattern = @"(\w+@\w+\.\w+)";
Match match = Regex.Match(text, pattern);
if (match.Success)
{
string email = match.Groups[1].Value;
Console.WriteLine($"Extracted email: {email}");
}
Output: Extracted email: john@example.com
In this example, the pattern @”(\w+@\w+\.\w+)” matches and extracts an email address from the given text. The captured email address is then printed to the console. Let’s take a look at this RegEx pattern:
After declaring a verbatim string literal with the “@” symbol and quotation marks, we use open “(” and closed “)” parentheses to group parts of the pattern together inside a capturing group. This allows us to treat the entire pattern as one group.
The “\w” matches word characters. The “+” means we want to match “one or more occurrences” of word characters.
The second “@” symbol is matched literally. In this RegEx pattern to match an email address, we look for one or more word characters (\w+) followed by the “@” symbol.
After matching the first group of word characters and the “@” symbol, we use “\w+” again to find the next group of word characters, representing the domain part of the email address.
The period (dot) character is matched using “.”. In RegEx, the dot is a special character that matches any character, so we must use a backslash to match it literally.
Finally, we use “\w+” again to match the top-level domain (TLD) portion of the email address.
With capturing groups, we can extract desired substrings from the text and store them in different objects. In this example, we only have one capturing group, but we can use multiple capturing groups together for more complex extraction scenarios.
To demonstrate using multiple capturing groups, let’s extract both the name and email address from the text:
string text = "Contact John Smith at john@example.com or Jane Doe at jane@example.com.";
string pattern = @"(\w+\s\w+)\s+at\s+(\w+@\w+\.\w+)";
MatchCollection matches = Regex.Matches(text, pattern);
if (matches.Count > 0)
{
Console.WriteLine($"{matches.Count} matches found.");
foreach (Match match in matches)
{
string name = match.Groups[1].Value;
string email = match.Groups[2].Value;
Console.WriteLine($"Name: {name}, Email: {email}");
}
}
Output: 2 matches found.
Name: John Smith, Email: john@example.com
Name: Jane Doe, Email: jane@example.com
In this example, we use two capturing groups in the pattern (\w+\s\w+)\s+at\s+(\w+@\w+\.\w+). The first capturing group (\w+\s\w+) captures the full name, and the second capturing group (\w+@\w+\.\w+) captures the email address. The ‘\s’ metacharacter in the first group matches a non-word character such as a tab or space. In this case, we use it to match the space between the first and last name.
Between each capturing group, we use ‘\s+at\s+’ to acknowledge the larger pattern that ties the person’s name to their email address within the string. Each ‘\s+’ metacharacter matches the spaces within the analyzed string, and the ‘at’ is a literal match for the word “at.” We don’t encapsulate this part within its own capturing group since we don’t need to extract this information. Instead, we include it in the RegEx pattern to ensure correct matching of each person’s name to the correct email address.
To access the captured data, we use match.Groups[1].Value for the first capturing group and match.Groups[2].Value for the second capturing group. By combining multiple capturing groups, we can effectively extract different pieces of information from the text.
Conclusion
Regular Expressions are a powerful tool for pattern matching and text extraction in C#. In this two-part series, we covered the basics of using RegEx for pattern matching and explored how to extract substrings using capturing groups. With these techniques, you can efficiently process and extract data from complex text patterns, making it a valuable skill in various real-world applications. There is a lot more to RegEx that we haven’t covered in this article, but this should give you the knowledge to get started using Regular Expressions in your own applications.
Remember to practice and experiment with different patterns to gain confidence in using Regular Expressions effectively. Happy coding!
Citations- “Regular Expressions – .NET Documentation.” Microsoft Documentation. Accessed July 19, 2023. Link
- Microsoft. “Regular Expression Language – Quick Reference – .NET Documentation.” Microsoft Documentation. Accessed July 19, 2023. Link
- OpenAI. (2021). ChatGPT [Computer software]. Retrieved from https://openai.com/
Interested in learning more about how Strabo Partners and BigCommerce can take your business to the next level?