Tags: ruby grep regexp regexPublish Date: 2016-08-27
More often than not, I see that many developers and sys-admins find it a rather daunting task to match patterns using Regular Expressions (regex or regexp). And to be honest, it wasn't my favorite topic either.
However, at some point I realised that the main reason why people were struggling with Regular Expressions, was due to the fact that many of us don't have a systematic approach for it.
Instead of breaking down the thought-process in different components, some people just try to 'match the pattern'. And instead of understanding the building-blocks of regex, they just learn the syntax.
So in this serie of articles, I will break down the different building blocks of Regular Expressions.
I hope this serie of articles will help people to become more confident with regular expressions and enable them to approach pattern-matching in a more systematic way.
This article will solely focus on the concept of a Character Class.
Concepts such as Quantifiers, Anchors, Groups, Lookarounds and the various methods that Ruby provides for pattern matching (
scan) will be discussed in future articles of this serie.
In short, Character Classes are what you would put into square brackets
[ ]or an equivalent shorthand notation.
A Character Class allows you to define which characters to match or not to match. You can define it in a variety of ways: whitelisting, range, negation or with a shorthand.
To demonstrate how a Character Class can be applied, we will be using Ruby's
String.scan method, which will return all matches. The differences between
scan will be discussed in a future article.
We will also be using grep. (with -E option for extended regular expression), a command line tool that can be used to match patterns within a text file or STDIN.
List of characters
To whitelist a bunch of characters you would like to match:
The example above will match the vowels.
In Ruby, you could do the following:
"foobar".scan(/[aeiou]/) #=> ["o", "o", "a"]
Or in grep:
$ echo 'abc cnn fox' | grep --color -E [aeiou]
abc cnn fox
Range of characters
To specify a range of alphanumerical characters you would like to match, use a hyphen
-between the start and the end of the range:
This example will match characters from a to c. (uppercase not included)
"abc cnn fox".scan /[a-c]/ #=> ["a", "b", "c", "c"]
$ echo 'abc cnn fox' | grep --color -E [a-c]
abc cnn fox
Negation of characters
Instead of whitelisting, you could also perform blacklisting. Let's take the previous example of whitelisting vowels.
We could revert it by providing a carat
^ as the first character within the square brackets. In that case, we will match anything that is NOT a vowel.
"foobar!".scan /[^aeiou]/ #=> ["f", "b", "r", "!"]
Please pay attention that
[^aeiou] does not translate to "consonants only"!
[^aeiou] matches anything except from aeiou, meaning that it will also match digits and other characters. In our example, it also matches the !.
$ echo 'foobar!' | grep --color -E [^aeiou]
As mentioned before, you normally would define Character Classes with square brackets
, but in some cases, you can substitute the square-brackets notation with a shorthand
|Square brackets notation||Shorthand notation|
|Whitespace characters (including tabs and newlines)||
|Anything except word characters||
|Anything except from digits||
|Anything except from whitespace characters||
Examples in Ruby:
"foobar!".scan /\w/ #=> ["f", "o", "o", "b", "a", "r"] "foobar!".scan /\W/ #=> ["!"]
Examples in grep:
$ echo 'foobar' | grep --color -E '\w'
$ echo 'foobar!' | grep --color -E '\W'