Breaking down Regular Expressions in Ruby Part 1 - Character Classes

Tags: ruby grep regexp regex
Publish Date: 2016-08-27

More often than not, I see that many developers and sys-admins find it a rather daunting task to match patterns using Regular Expressions (regex or regexp). And to be honest, it wasn't my favorite topic either.

However, at some point I realised that the main reason why people were struggling with Regular Expressions, was due to the fact that many of us don't have a systematic approach for it.
Instead of breaking down the thought-process in different components, some people just try to 'match the pattern'. And instead of understanding the building-blocks of regex, they just learn the syntax.

So in this serie of articles, I will break down the different building blocks of Regular Expressions.
I hope this serie of articles will help people to become more confident with regular expressions and enable them to approach pattern-matching in a more systematic way.

This article will solely focus on the concept of a Character Class.
Concepts such as Quantifiers, Anchors, Groups, Lookarounds and the various methods that Ruby provides for pattern matching (match =~ and scan) will be discussed in future articles of this serie.

 

Character Classes

In short, Character Classes are what you would put into square brackets [ ]or an equivalent shorthand notation.

A Character Class allows you to define which characters to match or not to match. You can define it in a variety of ways: whitelisting, range, negation or with a shorthand.

To demonstrate how a Character Class can be applied, we will be using Ruby's String.scan method, which will return all matches. The differences between match, =~ and scan will be discussed in a future article. 

We will also be using grep. (with -E option for extended regular expression), a command line tool that can be used to match patterns within a text file or STDIN.

 

List of characters

To whitelist a bunch of characters you would like to match:

[aeiou]

The example above will match the vowels. 

In Ruby, you could do the following:

"foobar".scan(/[aeiou]/)
#=> ["o", "o", "a"] 

Or in grep:

$ echo 'abc cnn fox' | grep --color -E [aeiou]

abc cnn fox

 

 

Range of characters

To specify a range of alphanumerical characters you would like to match, use a hyphen -between the start and the end of the range:

[a-c]

This example will match characters from a to c. (uppercase not included)

In Ruby:

"abc cnn fox".scan /[a-c]/
#=> ["a", "b", "c", "c"]

In grep:

$ echo 'abc cnn fox' | grep --color -E [a-c]

abc cnn fox

 

 

Negation of characters

Instead of whitelisting, you could also perform blacklisting. Let's take the previous example of whitelisting vowels.

We could revert it by providing a carat ^ as the first character within the square brackets. In that case, we will match anything that is NOT a vowel.

In Ruby:

"foobar!".scan /[^aeiou]/
#=> ["f", "b", "r", "!"] 

Please pay attention that [^aeiou] does not translate to "consonants only"!
In fact, [^aeiou] matches anything except from aeiou, meaning that it will also match digits and other characters. In our example, it also matches the !.

In grep:

$ echo 'foobar!' | grep --color -E [^aeiou]

foobar!

 

 

Shorthands

As mentioned before, you normally would define Character Classes with square brackets [], but in some cases, you can substitute the square-brackets notation with a shorthand

  Square brackets notation Shorthand notation
Word characters [0-9a-zA-Z_] \w
Digits [0-9] \d
Whitespace characters (including tabs and newlines) [ \t\r\n] \s
Anything except word characters [^0-9a-zA-Z_] \W
Anything except from digits [^0-9] \D
Anything except from whitespace characters [^ \t\r\n] \S

 

Examples in Ruby:

"foobar!".scan /\w/
#=> ["f", "o", "o", "b", "a", "r"] 
"foobar!".scan /\W/
#=> ["!"]

Examples in grep:

$ echo 'foobar' | grep --color -E '\w'

foobar

$ echo 'foobar!' | grep --color -E '\W'

foobar!

 
 

Next up:

So that sums up the basics of character classes. In the next article, we will look at Quantifiers.
 

Lambda, procs and ActiveRecord scopes - Part 2

Tags: ActiveRecord, ruby, rails, lambda
Publish Date: 2016-06-19

As mentioned in my previous post, there are many "Rails Programmers" who simply follow the examples without truly understanding the Rails API. So as a folllow up to my previous post, here is a look at what the scope method does within ActiveRecord.

 

scope method within ActiveRecord(::Scoping::Named::ClassMethods)

Very often, when we use the scope method within ActiveRecord, we only provide 2 arguments: The name of the scope and a lambda.

Let's dig into the Rails API and see what the scope method does with the arguments you pass into it.

# File activerecord/lib/active_record/scoping/named.rb, line 141
def scope(name, body, &block)
  unless body.respond_to?(:call)
    raise ArgumentError, 'The scope body needs to be callable.'
  end

  if dangerous_class_method?(name)
    raise ArgumentError, "You tried to define a scope named \"#{name}\" "                "on the model \"#{self.name}\", but Active Record already defined "                "a class method with the same name."
  end

  extension = Module.new(&block) if block

  singleton_class.send(:define_method, name) do |*args|
    scope = all.scoping { body.call(*args) }
    scope = scope.extending(extension) if extension

    scope || all
  end
end

First, it performs some checks: in line 3-5 above, it checks whether the second argument responds to the method call. So technically, we could also use a Proc.new for our second argument... but it's better to keep it with lambda (explained in the next post). Then In line 7-9, it checks whether the name of your scope clashes with class methods that are already defined by ActiveRecord.

Line 13-15 is where the scope definition happens: it defines a class method named after the first argument, its method parameters are the block parameters from the lambda and the lambda's body becomes the method body. The returned value (captured by local variable scope) is an instance of ActiveRecord::Relation. 

If you provide a block to the scope method, the block's body will be used to define a anonymous module (line 11), which would be extended by the ActiveRecord::Relation object in line 15.

In case the above was confusing, let's take the classical BlogPost class as example

 

BlogPost <ActiveRecord::Base example

We could define some scopes to filter out posts, so we might have the following:

class BlogPost < ActiveRecord::Base
  scope :title_matches, lambda{ |str| where("title LIKE ?", "%#{str}%") }
  scope :published, lambda{ where(published: true) }
end

Behind the scenes, what the scope invocation above do, is the following:

class BlogPost < ActiveRecord::Base
  def self.title_matches(str)
    where("title LIKE ?", "%#{str}%")
  end
  
  def self.published
    where(published: true)
  end
end
 
Extending scope by adding a block

For the BlogPosts returned from :published, we might want to apply further scoping, which wouldn't make sense to the BlogPosts returned from other scopes (e.g. :title_matches). In that case, we could provide a block to the scope method as a third argument.

Here is a scenario where an additional block to the scope method would come in handy: Let's say we've applied tagging to our BlogPost class and for setting/getting its tags, we use the ActAsTaggableOn gem. By doing so, each instance of BlogPost responds to the :tag_list method, which would return an array of strings, which are the tags of a BlogPost.

Now, among the published post, we would like to count how many times each tag has been used. For that, we define a tags_count method within the scope of :published. See below:

class Post < ActiveRecord::Base
  acts_as_taggable # Specific for the ActAsTaggableOn gem

  scope :published, lambda{ where(published: true) } do
    def tags_count
      map(&:tag_list).inject({}) do |hash, arr|
        arr.each do |elem|  
          hash[elem] ? hash[elem] += 1 : hash[elem] = 1
        end
        hash
      end
    end
  end
end

BlogPost.published.tags_count
# => {"ruby"=>1, "javascript"=>2}

The code above actually produces the following:

class BlogPost < ActiveRecord::Base
  # Define an anonymous Module:
  extensions = Module.new do
    def tags_count
      map(&:tag_list).inject({}) do |hash, arr|
        arr.each do |elem|  
          hash[elem] ? hash[elem] += 1 : hash[elem] = 1
        end
        hash
      end      
    end
  end

  # Then define a class method and extend the ActiveRecord::Relation object with the anonymous Module from above
  def self.published
    scope = where(published: true)
    scope.extend extensions
    scope
  end
end

So in a nutshell, by extending the ActiveRecord::Relation object with the newly created anonymous module, all methods defined within this module now becomes available to the ActiveRecord::Relation object which was returned by the original scope.

In case the example with inject({}) was confusing, I would recommend you to check my earlier post for an example. 

As a continuation of this topic, in the next post, I will explain what the differences are between a Proc and a Lambda and why we should invoke the scope method with a lambda rather than a Proc.

Lambda, procs and ActiveRecord scopes - Part 1

Tags: ruby, lambda
Publish Date: 2016-06-19
Contents:
  1. The lambda method vs "stabby lambda" constructor ->(){}
  2. scope method within ActiveRecord(::Scoping::Named::ClassMethods)
  3. Differences between lambda and Proc

 

During a pair-programming session some time ago, while writing a rather long ActiveRecord scope, my coding-partner was asking me whether "it would work if we replace the curly braces {} with do / end".

scope :scope_name, -> (block-param1, block-param2 ){ some_invocation(block-param1).some_other_invocation(block-param2) }

I was a bit surprised when he asked this. As an experienced programmer with a strong acedemic background in CS, it should be clear to him that the curly braces were wrapping a code-block, right?

However, I can also see why my coding-partner was asking this:

  1. The stabby lambda's lack of explicity can create confusion.
  2. Many "Rails developers" simply follow code examples without looking seriously into the API. As a result, they don't know what they are doing.

 

The lambda method and "stabby lambda" constructor -> (){}

Both the lambda method and the "stabby lambda" produce the same result, so why are there two notations?

The reason is that, in Ruby versions older than 1.9, the interpreter had problems parsing lambda's that have block parameters with default values. The old interpreters could not figure out whether the second pipe (|) was a delimiter for block-parameters or a Bitwise OR operator:

# Cannot compile lambdas with block parameters that have default values
lambda { |a,b=1| puts a* b }

# No problems when block parameters have no default values
lambda { |a,b| puts a* b }
 => #<Proc:0x00007fd48490a7f8@(irb):5> 

As a result of this problem, the stabby lambda constructor has been added:

-> (a, b=0) { puts a * b }

However, this parsing problem has been solved from 1.9 and therefore the stabby lambda was no longer necessary. But in the meantime, it has created a cult following among Rails developers.

If you look at the Rails Guide for version 3.2 and before, you can see that they used to demonstrate scope with the lambda method. Then from version 4.0, for whatever reason, the Rails team has decided to demostrate the scope method with the stabby lambda constructor.

 

Pros and Cons of both notations

One reason why some prefer the stabby lambda, is that it is shorter than typing the word lambda.

However, in contrast to the lambda method, the "stabby lambda" constructor does not explicitly tell you that it is creating a lambda. As a result, when people follow the examples on Rails Guide (or elsewhere) with a stabby constructor, there is a chance that they are not aware of that they are actually creating a lambda.

In my opinion, even though lambda method takes more keystrokes, it still has the advantage of providing more clarity and explicity. Just imagine you being a new Ruby programmer, who has to perform a google search for "->(){} ruby" instead of "lambda ruby"!

In the next post, we will explore how the scope method within ActiveRecord uses lambda and blocks to define class methods and extends ActiveRecord::Relation objects.