Dissecting a Regular Expression

Posted by Curtis Miller Curtis Miller

I am a beginner at regular expressions, but recently found the need to create one. My goal was to take in a file name, split it at the dot ('.') just before the file extension and then hash everything to the left to get a unique name to store on the server.

I thought I had an okay solution until a user uploaded a file with a name similar to Oct. 24 2006 040.jpg. Notice that there is white space as well as multiple dots in the string. So how do you accomplish the goal under those conditions?

I eventually came up with the following solution:

def sanitize_filename(file_name)

Clearly, it is not the worst regular expression out there, but can be complicated to a n00b like me. So, let's dissect it.

The regular expression is given by


Everything between the enclosing slash characters ('/') is considered part of the regular expression. The beginning of the string in the regular expression is signified by the caret character ('^') and the end of the string is signified by the dollar sign character ('$'). In our case the caret is enclosed in parenthesis. Anything enclosed in parenthesis in a regular expression is called a capture. Anything captured can be referred to later as we will see.

To accomplish our goal we are looking for

  • any number of characters => (.*?)
  • followed by a dot and a file extension => (\.[^\.]{3,5})

In (.*?), the dot means any character, the asterisk ('*') means zero or more, and the question mark ('?') means for the expression to be greedy (i.e., find this pattern multiple times). So we end up with an English expression something like ‘zero or more of any character repeated many times'.

In (\.[^\.]{3,5}), slash-dot ('\.') is just an escaped dot character. This is done because, as we saw earlier, a plain dot means any character. If we want to match a dot in our string, then we must escape the dot character. The '[^\.]' means any character except a dot. In this context the caret is used for negation. Finally, the ‘{3,5}' just means of length between 3 and 5. So we end with an English expression similar to ‘a dot followed by 3 to 5 characters that are not dots'.

The second part of the function above shows the actual usage of the captured parts of the expression. Ruby automatically gives you special variables to access the captured elements of a regular expression. They follow the naming convention of $1, $2, $3, ..., $n. In our regular expression $2 is everything to the left of the dot character and $3 is the dot and file extension.

Please let me know if you have a better solution.

Velocity Labs

Need web application development, maintenance for your existing app, or a third party code review?

Velocity Labs can help.

Hire us!