Sunday, 16 November 2008

5. Two types of symbols (but they're the same symbols)

So, in the last entry, I introduced the idea of metacharacters - symbols that mean something. There's a list of them on the right there ('know your metacharacters').

Text is always comprised of symbols. Letters are symbols. Punctuation marks are symbols. When using regexes, we break down symbols into two types: characters and metacharacters, which you can think of as 'normal' and 'special'. 

Here's the tricky bit: When we use regexes, we are invariably using characters and metacharacters in the same piece of text.

So there are a bunch of symbols that can mean either what they normally mean in English (or any other language) or a special pattern matching instruction. (btw - yes, there is an easy way to tell the difference. I'll get to that.)

Here's an example: The symbols . and ?

AS CHARACTERS

. means 'this is the end of the sentence'
? means 'this is the end of the sentence, and this sentence is a question'

AS METACHARACTERS

. means 'any character'
? means 'zero or one of the character before the ?'

Totally different. The best thing to ignore your life-long understanding of what . and ? mean as characters - the relationship between the character meaning and the metacharacter meaning is completely arbitrary.

This still leaves us with the question: If they're the same, how do you tell them apart?

No comments: