Mastering Regular Expressions in C#
By: Skip Townsend (Part 1 of a 2 Part Series)
Regular expressions, or RegEx, are a powerful tool for pattern matching and manipulating strings. However, if you’re not familiar with their syntax, implementing regular expressions can be challenging. In the first part of this two-part series, we’ll explore several examples of regular expressions, discuss their significance, and break down each component for better understanding. By the end, you’ll be ready to effectively implement regular expressions for pattern matching in your own projects!
Let’s begin with an example of a RegEx used in Product Desk, Strabo Partner’s Product Management Tool (PIM). In this case, we employ the Regular Expression “^[a-zA-Z0-9\-_.~ ]$” within a Fluent Validation Rule to ensure that the Product Group ID, used in the URL for displaying the Product Group Details page, contains only valid URL characters. Let’s dig in:
- A regular expression always starts and ends with anchors. The “^” anchor marks the start of the expression, while the “$” anchor denotes its end. Together, they ensure that the entire ProductGroupID matches the RegEx pattern.
- The content within the brackets “[ ]” form a character class, defining the acceptable range of characters within the evaluated string. In this case, the character class includes alphanumeric characters (a-z, A-Z, 0-9), as well as hyphen “–“, underscore “_”, period “.”, tilde “~”, and space “ “. To interpret the hyphen “–” as a literal character and not part of a range, we precede it with a backslash “\” escape character.
This regular expression ensures that the ProductGroupID consists solely of alphanumeric characters, hyphens, underscores, periods, tildes, and spaces. It effectively validates whether the Product Group ID can be used in a URL. Let’s consider a few examples of Product Group IDs that would undergo analysis using this RegEx:
- Wheels&Tires – This example fails the test due to the presence of the excluded special character “&”.
- Wheel Accessories – This example passes the test as it includes uppercase and lowercase letters, along with an included special character, a space.
- Rear/Wheels – Our final example fails because it contains an excluded special character, “/“.
Next, let’s examine another RegEx borrowed from our partner, Big Commerce. Product Desk integrates seamlessly with Big Commerce, allowing clients to manage their products and instantaneously keep their Big Commerce store-fronts up to date. In the Big Commerce Shipping API documentation, they utilize this RegEx to define a valid Harmonized System Code for their platform: “/^[0-9A-Za-z]{6,14}$/“. While it may appear familiar, there are a few differences that make it a useful example:
- In addition to the beginning and ending anchors “^” and “$”, this regular expression is wrapped in forward slashes “/”. These delimiters are commonly used in languages like Perl, PHP, and Ruby, but less frequently in C#, Java, and JavaScript.
- Following the character class, we observe a new addition: “{6,14}“. In RegEx, this part of the expression is known as a quantifier. It enforces a minimum and maximum number of characters when creating Harmonized System Codes on the Big Commerce platform.
Let’s explore some examples of strings that can be analyzed using this RegEx:
- AbCdE12345 – This passes as its length is within the allowed range, and it includes uppercase letters, lowercase letters, and digits, all of which are accounted for in the pattern’s character class.
- Ab.12 – This string satisfies almost all of the requirements, except that it includes an excluded special character “.” and falls short of the minimum required length of 6 characters.
By using this regular expression, Big Commerce can validate whether a given string meets their criteria for a valid Harmonized System Code.
Now that we’ve gotten a basic understanding of how we can use regular expressions to validate strings, let’s take a look at an example that is a little more complicated. This one builds on our first two examples and I promise, it’s not as scary as it looks.
Regular Expression: “^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[\-.~])[A-Za-z0-9\-.~ ]{8,20}$“
In this regular expression, we have multiple positive lookahead assertions represented by (?=.*[A-Z]), (?=.*[a-z]), (?=.*\d), and (?=.*[\-_.~]). Each assertion enforces a specific requirement within the expression.
- The first two assertions, (?=.*[A-Z]) and (?=.*[a-z]), ensure that there is at least one uppercase and one lowercase letter somewhere in the string.
- The third assertion, (?=.*\d), ensures that there is at least one digit (0-9) in the string.
- The final assertion, (?=.*[\-_.~]), ensures that at least one of the following characters appears somewhere in the string: hyphen “–“, underscore “_”, period “.”,tilde “~”, or a space “ “.
The positive lookahead assertions delineate the required characters for the string being analyzed. After that, the character class [A-Za-z0-9\-_.~ ] specifies the allowed characters, and finally, the {8,20} quantifier enforces a length requirement between 8 and 20 characters.
It’s not hard to imagine this regular expression being used to enforce password requirements. It ensures that the string, or password, contains at least one uppercase letter, one lowercase letter, one digit, and one of the specified special characters, while also enforcing a minimum and maximum length.
Before we move on, lets consider one more possibility. Wouldn’t it be more concise to consolidate the positive lookahead assertions into the one, shorter statement – (?=.*[a-ZA-Z\d\-_.~])? While it takes up less space and is a bit easier to read, the answer is no. This statement is not equivalent to our original, and much longer one. Here’s why:
- “^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[-_.~])[A-Za-z0-9-.~ ]{8,20}$” requires one uppercase letter to be present in the string, AND one lowercase letter, AND one digit, AND one of the characters “–“, “_”, “.”, “~”, or “ “.
- “^(?=.*[ A-Za-z\d-_.~]) [A-Za-z0-9-.~ ]{8,20}$” requires that one uppercase letter exists in the string, OR one lowercase letter, OR one digit, OR one of the characters “–“, “_”, “.”, “~”, or “ “. This expression would not ensure the presence of all of the different character types.
These three examples demonstrate how regular expressions can be used for pattern matching in C#. By understanding the syntax and components of regular expressions, you can create powerful string validation rules for various use cases in your own projects.
Remember to test your regular expressions thoroughly with different input scenarios to ensure they behave as expected. Regular expressions can be complex, so it’s important to consider edge cases and potential pitfalls.
Additionally, there are various online resources and tools available that can assist in building and testing regular expressions, such as regex101.com and RegEx libraries for specific programming languages.
I hope this explanation helps you gain a better understanding of how you can use regular expressions for pattern matching in C#. In the next post, we’ll dive into how you can use RegEx for matching and extracting substrings from blocks of text. If you have any further questions, feel free to ask!
- Citations: Microsoft. “Regular Expressions – .NET Documentation.” Microsoft Documentation. Accessed June 10, 2023. Link
- Microsoft. “Regular Expression Language – Quick Reference – .NET Documentation.” Microsoft Documentation. Accessed June 10, 2023. Link
- Stack Overflow. “To Use or Not to Use Regular Expressions?” Stack Overflow. Accessed June 10, 2023. Link
- BigCommerce. “Customs Information – REST API – ShipperHQ.” BigCommerce Developer Documentation. Accessed June 10, 2023. Link
- Regex101. “Regular Expression Tester and Debugger.” Regex101.com. Accessed June 15, 2023. Link
- OpenAI. (2021). ChatGPT [Computer software]. Retrieved from https://openai.com/
Interested in learning more about how Strabo Partners and BigCommerce can take your business to the next level?