Node:Syntax, Next:count-words-in-defun, Previous:Words and Symbols, Up:Words in a defun
Emacs treats different characters as belonging to different
syntax categories. For example, the regular expression,
\\w+, is a pattern specifying one or more word
constituent characters. Word constituent characters are members of
one syntax category. Other syntax categories include the class of
punctuation characters, such as the period and the comma, and the
class of whitespace characters, such as the blank space and the tab
character. (For more information, see Syntax, and Syntax Tables.)
Syntax tables specify which characters belong to which categories.
Usually, a hyphen is not specified as a `word constituent character'.
Instead, it is specified as being in the `class of characters that are
part of symbol names but not words.' This means that the
count-words-region function treats it in the same way it treats
an interword white space, which is why
multiply-by-seven as three words.
There are two ways to cause Emacs to count
one symbol: modify the syntax table or modify the regular expression.
We could redefine a hyphen as a word constituent character by modifying the syntax table that Emacs keeps for each mode. This action would serve our purpose, except that a hyphen is merely the most common character within symbols that is not typically a word constituent character; there are others, too.
Alternatively, we can redefine the regular expression used in the
count-words definition so as to include symbols. This
procedure has the merit of clarity, but the task is a little tricky.
The first part is simple enough: the pattern must match "at least one character that is a word or symbol constituent". Thus:
\\( is the first part of the grouping construct that
\\w and the
\\s_ as alternatives, separated
\\w matches any word-constituent
character and the
\\s_ matches any character that is part of a
symbol name but not a word-constituent character. The
following the group indicates that the word or symbol constituent
characters must be matched at least once.
However, the second part of the regexp is more difficult to design. What we want is to follow the first part with "optionally one or more characters that are not constituents of a word or symbol". At first, I thought I could define this with the following:
The upper case
S match characters that are
not word or symbol constituents. Unfortunately, this
expression matches any character that is either not a word constituent
or not a symbol constituent. This matches any character!
I then noticed that every word or symbol in my test region was followed by white space (blank space, tab, or newline). So I tried placing a pattern to match one or more blank spaces after the pattern for one or more word or symbol constituents. This failed, too. Words and symbols are often separated by whitespace, but in actual code parentheses may follow symbols and punctuation may follow words. So finally, I designed a pattern in which the word or symbol constituents are followed optionally by characters that are not white space and then followed optionally by white space.
Here is the full regular expression:
"\\(\\w\\|\\s_\\)+[^ \t\n]*[ \t\n]*"