If you’ve spent any time writing code you’ve no doubt abused regular expressions until they were an inscrutable character jumble that could give a real parser a run for its money. Even so, I was still surprised when I learned that there are 3 different kinds of parentheses in regular expressions, not just 2.

And no, the 2 aren’t left and right, wise guy.

The 3 types of parentheses are Literal, Capturing, and Non-Capturing. You probably know about capturing parentheses. You’ll recognize literal parentheses too. It’s the non-capturing parentheses that’ll throw most folks, along with the semantics around multiple and nested capturing parentheses. (True RegEx masters, please hold the, “But wait, there’s more!” for the conclusion).

Literal Parentheses

Literal Parentheses are just that, literal text that you want to match. Suppose you want to match U.S. phone numbers of the form (xxx)yyy-zzzz. You could write the regular expression as /\(\d{3})\d{3}-\d{4}/. Notice that we had to type \( instead of just a naked (. That’s because a raw parenthesis starts a capturing or non-capturing group. If we want to match a literal parenthesis in the text, we have to escape it with \.

Capturing Parentheses

You’ve probably written some capturing parentheses too, whether you meant to capture or not. These parentheses aren’t used to match literal () in the text, but instead they are used to group characters together in a regular expression so that we can apply other operators like +, *?, or {n}.

For example, if we want to match just the strings can or can’t we can write /can('t)?/. We need the parentheses here because /can't?/ would match only the strings can’, and can’t, not quite what we had in mind.

Butterfly in Hands

However, there’s something else going on here. These are called capturing parentheses for a reason — namely they capture anything that matches the expression they contain for later use by your program. Continuing the can/can’t example, in JavaScript we get:

 const match = /can('t)?/.exec("We can't do it!");
 console.log(match[0]); // prints the match "can't"
 console.log(match[1]); // prints captured "'t"

Here, match[1] contains the item captured by the parentheses. Now this is somewhat uninteresting because we really don’t care about the ‘t separately from the word can’t.

The phone number example gets more interesting. In JavaScript, we can extract the area code of a U.S. style phone number as follows:

const match = /\((\d{3})\)\d{3}-\d{4}/.exec("(303)555-1212");
console.log(match[0]); // (303)555-1212
console.log(match[1]); // 303

Let’s take a closer look at what is going on in that regular expression, /\((\d{3})\)\d{3}-\d{4}/. It is almost identical to the expression we used in the literal parentheses example, but this time I added a set of capturing parentheses inside the pair of literal parentheses. This tells the regular expression engine to remember the part of the match that is inside the capturing parentheses. This captured match is what we find in match[1]. Notice that the entire phone number match is in match[0]. This little example shows the power of capturing parentheses. Above, we used it to extract an area code from a phone number. We can use it to extract all kinds of text — a poor man’s parser.

XKCD “Exploits of a Mom”

XKCD “Exploits of a Mom”

As another quick example, we can use capturing parentheses to extract first name and last name via /(\D+) (\D+)/. match[1] will have the first name and match[2] will have the last name, assuming you’re not matching Bobby Tables’ given name (see comic), or have extra spaces to deal with.

Non-capturing Parentheses

Now, we get to the third kind of parenthesis — non-capturing parentheses. There are times when you need to group things together in a regular expression, but you don’t want to capture the match, like in the can/can’t example above. To avoid capturing the ‘t, we write /can(?:'t)?/. Here, all we get is the full match, with no sub-matches.

The (?: is a special sequence that starts a parenthesized group, just like (, but the regular expression engine is told, don’t bother to capture the match in the group, just use it for operator precedence. Let’s look at a more complex example where ignoring a parenthesized group is useful.

Let’s extend that phone number regular expression to allow a prefix of mobile or office. With only capturing parentheses, this looks like match = /((mobile|office) )?\((\d{3})\)\d{3}-\d{4}/.exec(...). (Is this inscrutable yet?). The problem is that the area code we want to extract is in match[3]. (I’ll leave it as an exercise to the reader as to why.) This is confusing and unnecessary since we don’t care about the annotation or anything other than the area code in this example. To capture only the area code, we can do:

const re = /(?:(?:mobile|office) )?\((\d{3})\)\d{3}-\d{4}/;
const match = re.exec('mobile (303)555-1212');
console.log(match[0]); //mobile (303)555-1212
console.log(match[1]); //303

Notice the two sets of non-capturing parentheses (?: around the annotation, but the use of regular capturing parentheses around the area code.

Some people, when confronted with a problem, think
“I know, I’ll use regular expressions.”
Now they have two problems.

 –Jamie Zawinksi

But Wait, There’s More!

And there you have it, 3 kinds of parentheses, literal, capturing, and non-capturing — \(, (, (?:. We should probably use (?: more than we do, but I find it hard to read, so as long as ( doesn’t cause any performance issues or semantic changes to an existing regular expression (by changing the index needed to find relevant group matches), I’ll skip the extra ?:. I’m not sure if this is the best practice, but let’s face it, regular expressions are hard enough to read as it is.

True RegEx masters know that there are other types of parentheses that use the (? syntax as well. Alas, I’m not actually a RegEx master so I’ll leave you to searching for other sources to learn about those, as they aren’t supported in many native regular expression libraries, JavaScript being one of them. Named regular expression groups are among the most useful of these.

10 Comments
  1. David Awerbuch
    November 6, 2018Reply

    Parenthesis can be a little bewildering; this was a great article and clearly explained about capturing and non-capturing parenthesis - and I was not aware of non-capturing.

    So I should be able to easily use the above information to solve a problem - I need to convert a nam/value pair that looks like "name(three word value)" into "name('three word value')" so it can resubmitted to a command processor.

    I am able to capture the text out of the parenthesis wrapper, but the capture seems to be extending past the closing parenthesis to the end of the data.

    echo "DEFTYPE(PREDEFINED) DESCR(Administration Command Queue) DISTL(NO) GET(ENABLED)" \
    | sed -r "s/DESCR\((.*?)\)/DESCR('\1')/"

    is producing

    DEFTYPE(PREDEFINED) DESCR('Administration Command Queue) DISTL(NO) GET(ENABLED')

    I can't seem to get the isolated text, the close ' is being placed at the end before the last close parenthesis, even tough I am searching for the shortest occurence using the '?'. Can you help me out here?

    Thanks!

    • Manish Vachharajani
      November 6, 2018Reply

      The issue is that sed is matching greedily (finding the largest match) and thus matches all the way out to the closing paren on GET(ENABLED). You can see this by deleting the last paren in the echo'ed string and you'll see the quote gets inserted right after DISTL(NO.

      echo "DEFTYPE(PREDEFINED) DESCR(Administration Command Queue) DISTL(NO) GET(ENABLED" | sed -r "s/DESCR\((.*?)\)/DESCR('\1')/"

      produces

      DEFTYPE(PREDEFINED) DESCR('Administration Command Queue) DISTL(NO') GET(ENABLED

      You want non greedy matching to match the closest paren. I don't see an option to sed to enable non-greedy matching, but you could change the regex to look for ) DISTL if that is always there. If not, you can use non-greedy (reluctant) matching in Perl. See https://stackoverflow.com/questions/1103149/non-greedy-reluctant-regex-matching-in-sed

      • Jack Zinn
        November 27, 2018Reply

        Use “[^)]*” instead of “.*?” to capture until the next parenthesis.

  2. Jack Zinn
    November 27, 2018Reply

    Change the “.*?” to “[^)]*”. This will capture everything until you hit the closing parentheses.

  3. Roman
    February 18, 2019Reply

    Is there a way to set a limitation to how many times you can enter "("? Like in this example: (((((098)-098-0987 , you can enter as many parenthesis as you want and program would say that this is a valid form for the phone number.

    • Manish Vachharajani
      February 19, 2019Reply

      You can match this with either \(* for zero or more, \(+ for one or more, and you can usually use \({2,4} to say match 2 to 4 parentheses.

    • Manish Vachharajani
      February 19, 2019Reply

      I forgot to note though, that regular expressions cannot generally match a variable number of opening and closing parentheses. In other words, you cannot say that there should be 1 to 5 opening parentheses and then a matching number of closing parentheses. That kind of constraint falls out of the scope of what is known as regular languages, which regular expressions implement.

  4. Dexter Jones
    March 4, 2019Reply

    can you use () as a match in regex i.e. string pattern = @"{|}\\[|]"

    • Manish Vachharajani
      March 4, 2019Reply

      If you want to match a literal parenthesis you can escape it with a \. So, \(+ will match one or more left parentheses. You can thus match any fixed number of parens this way. \(\(x+\)\) will match ((xxx)). What you can't do is say I have an arbitrary number of parens but only match if the left and right ones are balanced.

  5. Zeki
    April 27, 2019Reply

    Wow! This gets me rid of all the parenthesis confusion I have. Thx!

Join the conversation

Your email address will not be published. Required fields are marked *

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.