ARCHIVE NOTICE

My website can still be found at industrialcuriosity.com, but I have not been posting on this blog as I've been primarily focused on therightstuff.medium.com - please head over there and take a look!

Saturday 24 October 2020

Reading and writing regular expressions for sane people

Your regular expressions need love. Reviewers and future maintainers of your regular expressions need even more.

No matter how well you've mastered regex, regex is regex and is not designed with human-readability in mind. No matter how clear and obvious you think your regex is, in most cases it will be maintained by a developer who a) is not you and b) lacks context. Many years ago I developed a simple method for sanity checking regex with comments, and I'm constantly finding myself demonstrating its utility to new people.

There are some great guides out there, like this one, but what I'm proposing takes things a step or two further. It may take a minute or two of your time, but it almost invariably saves a lot more than it costs. I'm not even discussing flagrant abuse or performance considerations.


Traditional regex: the do-it-yourself pattern


The condescending regex. Here you're left to your own devices. Thoughts and prayers.

var parse_url = /^(?:([A-Za-z]+):)?(\/{0,3})(0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/;

(example taken from this question)


Kind regex: intention explained


It's the least you can do! A short line explaining what you're matching with an example or two (or three).

// parse a url, only capture the host part
// eg. protocol://host
//     protocol://host:port/path?querystring#anchor
//     host
//     host/path
//     host:port/path
var parse_url = /^(?:([A-Za-z]+):)?(\/{0,3})(0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/;


Careful regex: a human-readable breakdown


Here we ensure that each element of the regex pattern, no matter how simple, is explained in a way that makes it easy to verify that it's doing what we think it's doing and can modify it safely. If you're not an expert with regex, I recommend using one of the many available tools such as regexr.com.

// parse a url, only capture the host part
//     (?:([A-Za-z]+):)?
//         protocol - an optional alphabetic protocol followed by a colon
//     (\/{0,3})(0-9.\-A-Za-z]+)
//         host - 0-3 forward slashes followed by alphanumeric characters
//     (?::(\d+))?
//         port - an optional colon and a sequence of digits
//     (?:\/([^?#]*))?
//         path - an optional forward slash followed by any number of
//         characters not including ? or #
//     (?:\?([^#]*))?
//         query string - an optional ? followed by any number of
//         characters not including #
//     (?:#(.*))?
//         anchor - an optional # followed by any number of characters
var parse_url = /^(?:([A-Za-z]+):)?(\/{0,3})(0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/;

Now that we've taken the time to break this down, we can identify the intention behind the patterns and ask better questions: why is the host the only matched group? Was this tested? (Because (0-9.\-A-Za-z] is clearly an error, and there are almost no restrictions on invalid characters)

Unless you're a sadist (or a masochist), this is definitely a better way to operate: be careful, and if you can't be careful then at least be kind.

No comments:

Post a Comment