Breaking down Regular Expressions in Ruby Part 1 - Character Classes

Tags: ruby grep regexp regex
Publish Date: 2016-08-27

More often than not, I see that many developers and sys-admins find it a rather daunting task to match patterns using Regular Expressions (regex or regexp). And to be honest, it wasn't my favorite topic either.

However, at some point I realised that the main reason why people were struggling with Regular Expressions, was due to the fact that many of us don't have a systematic approach for it.
Instead of breaking down the thought-process in different components, some people just try to 'match the pattern'. And instead of understanding the building-blocks of regex, they just learn the syntax.

So in this serie of articles, I will break down the different building blocks of Regular Expressions.
I hope this serie of articles will help people to become more confident with regular expressions and enable them to approach pattern-matching in a more systematic way.

This article will solely focus on the concept of a Character Class.
Concepts such as Quantifiers, Anchors, Groups, Lookarounds and the various methods that Ruby provides for pattern matching (match =~ and scan) will be discussed in future articles of this serie.

 

Character Classes

In short, Character Classes are what you would put into square brackets [ ]or an equivalent shorthand notation.

A Character Class allows you to define which characters to match or not to match. You can define it in a variety of ways: whitelisting, range, negation or with a shorthand.

To demonstrate how a Character Class can be applied, we will be using Ruby's String.scan method, which will return all matches. The differences between match, =~ and scan will be discussed in a future article. 

We will also be using grep. (with -E option for extended regular expression), a command line tool that can be used to match patterns within a text file or STDIN.

 

List of characters

To whitelist a bunch of characters you would like to match:

[aeiou]

The example above will match the vowels. 

In Ruby, you could do the following:

"foobar".scan(/[aeiou]/)
#=> ["o", "o", "a"] 

Or in grep:

$ echo 'abc cnn fox' | grep --color -E [aeiou]

abc cnn fox

 

 

Range of characters

To specify a range of alphanumerical characters you would like to match, use a hyphen -between the start and the end of the range:

[a-c]

This example will match characters from a to c. (uppercase not included)

In Ruby:

"abc cnn fox".scan /[a-c]/
#=> ["a", "b", "c", "c"]

In grep:

$ echo 'abc cnn fox' | grep --color -E [a-c]

abc cnn fox

 

 

Negation of characters

Instead of whitelisting, you could also perform blacklisting. Let's take the previous example of whitelisting vowels.

We could revert it by providing a carat ^ as the first character within the square brackets. In that case, we will match anything that is NOT a vowel.

In Ruby:

"foobar!".scan /[^aeiou]/
#=> ["f", "b", "r", "!"] 

Please pay attention that [^aeiou] does not translate to "consonants only"!
In fact, [^aeiou] matches anything except from aeiou, meaning that it will also match digits and other characters. In our example, it also matches the !.

In grep:

$ echo 'foobar!' | grep --color -E [^aeiou]

foobar!

 

 

Shorthands

As mentioned before, you normally would define Character Classes with square brackets [], but in some cases, you can substitute the square-brackets notation with a shorthand

  Square brackets notation Shorthand notation
Word characters [0-9a-zA-Z_] \w
Digits [0-9] \d
Whitespace characters (including tabs and newlines) [ \t\r\n] \s
Anything except word characters [^0-9a-zA-Z_] \W
Anything except from digits [^0-9] \D
Anything except from whitespace characters [^ \t\r\n] \S

 

Examples in Ruby:

"foobar!".scan /\w/
#=> ["f", "o", "o", "b", "a", "r"] 
"foobar!".scan /\W/
#=> ["!"]

Examples in grep:

$ echo 'foobar' | grep --color -E '\w'

foobar

$ echo 'foobar!' | grep --color -E '\W'

foobar!

 
 

Next up:

So that sums up the basics of character classes. In the next article, we will look at Quantifiers.