This file documents `awk', a program that you can use to select particular records in a file and perform operations upon them. This is Edition 1.0.3 of `Effective AWK Programming', for the 3.0.3 version of the GNU implementation of AWK. Copyright (C) 1989, 1991, 92, 93, 96, 97 Free Software Foundation, Inc. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the Foundation. General Introduction ******************** This file documents `awk', a program that you can use to select particular records in a file and perform operations upon them. This is Edition 1.0.3 of `Effective AWK Programming', for the 3.0.3 version of the GNU implementation of AWK. To Miriam, for making me complete. To Chana, for the joy you bring us. To Rivka, for the exponential increase. To Nachum, for the added dimension. Preface ******* This Info file teaches you about the `awk' language and how you can use it effectively. You should already be familiar with basic system commands, such as `cat' and `ls',(1) and basic shell facilities, such as Input/Output (I/O) redirection and pipes. Implementations of the `awk' language are available for many different computing environments. This Info file, while describing the `awk' language in general, also describes a particular implementation of `awk' called `gawk' (which stands for "GNU Awk"). `gawk' runs on a broad range of Unix systems, ranging from 80386 PC-based computers, up through large scale systems, such as Crays. `gawk' has also been ported to MS-DOS and OS/2 PC's, Atari and Amiga micro-computers, and VMS. ---------- Footnotes ---------- (1) These commands are available on POSIX compliant systems, as well as on traditional Unix based systems. If you are using some other operating system, you still need to be familiar with the ideas of I/O redirection and pipes. History of `awk' and `gawk' =========================== The name `awk' comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of `awk' was written in 1977 at AT&T Bell Laboratories. In 1985 a new version made the programming language more powerful, introducing user-defined functions, multiple input streams, and computed regular expressions. This new version became generally available with Unix System V Release 3.1. The version in System V Release 4 added some new features and also cleaned up the behavior in some of the "dark corners" of the language. The specification for `awk' in the POSIX Command Language and Utilities standard further clarified the language based on feedback from both the `gawk' designers, and the original Bell Labs `awk' designers. The GNU implementation, `gawk', was written in 1986 by Paul Rubin and Jay Fenlason, with advice from Richard Stallman. John Woods contributed parts of the code as well. In 1988 and 1989, David Trueman, with help from Arnold Robbins, thoroughly reworked `gawk' for compatibility with the newer `awk'. Current development focuses on bug fixes, performance improvements, standards compliance, and occasionally, new features. The GNU Project and This Book ============================= The Free Software Foundation (FSF) is a non-profit organization dedicated to the production and distribution of freely distributable software. It was founded by Richard M. Stallman, the author of the original Emacs editor. GNU Emacs is the most widely used version of Emacs today. The GNU project is an on-going effort on the part of the Free Software Foundation to create a complete, freely distributable, POSIX compliant computing environment. (GNU stands for "GNU's not Unix".) The FSF uses the "GNU General Public License" (or GPL) to ensure that source code for their software is always available to the end user. A copy of the GPL is included for your reference (*note GNU GENERAL PUBLIC LICENSE: Copying.). The GPL applies to the C language source code for `gawk'. A shell, an editor (Emacs), highly portable optimizing C, C++, and Objective-C compilers, a symbolic debugger, and dozens of large and small utilities (such as `gawk'), have all been completed and are freely available. As of this writing (early 1997), the GNU operating system kernel (the HURD), has been released, but is still in an early stage of development. Until the GNU operating system is more fully developed, you should consider using Linux, a freely distributable, Unix-like operating system for 80386, DEC Alpha, Sun SPARC and other systems. There are many books on Linux. One freely available one is `Linux Installation and Getting Started', by Matt Welsh. Many Linux distributions are available, often in computer stores or bundled on CD-ROM with books about Linux. (There are three other freely available, Unix-like operating systems for 80386 and other systems, NetBSD, FreeBSD,and OpenBSD. All are based on the 4.4-Lite Berkeley Software Distribution, and they use recent versions of `gawk' for their versions of `awk'.) This Info file itself has gone through several previous, preliminary editions. I started working on a preliminary draft of `The GAWK Manual', by Diane Close, Paul Rubin, and Richard Stallman in the fall of 1988. It was around 90 pages long, and barely described the original, "old" version of `awk'. After substantial revision, the first version of the `The GAWK Manual' to be released was Edition 0.11 Beta in October of 1989. The manual then underwent more substantial revision for Edition 0.13 of December 1991. David Trueman, Pat Rankin, and Michal Jaegermann contributed sections of the manual for Edition 0.13. That edition was published by the FSF as a bound book early in 1992. Since then there have been several minor revisions, notably Edition 0.14 of November 1992 that was published by the FSF in January of 1993, and Edition 0.16 of August 1993. Edition 1.0 of `Effective AWK Programming' represents a significant re-working of `The GAWK Manual', with much additional material. The FSF and I agree that I am now the primary author. I also felt that it needed a more descriptive title. `Effective AWK Programming' will undoubtedly continue to evolve. An electronic version comes with the `gawk' distribution from the FSF. If you find an error in this Info file, please report it! *Note Reporting Problems and Bugs: Bugs, for information on submitting problem reports electronically, or write to me in care of the FSF. Acknowledgements ================ I would like to acknowledge Richard M. Stallman, for his vision of a better world, and for his courage in founding the FSF and starting the GNU project. The initial draft of `The GAWK Manual' had the following acknowledgements: Many people need to be thanked for their assistance in producing this manual. Jay Fenlason contributed many ideas and sample programs. Richard Mlynarik and Robert Chassell gave helpful comments on drafts of this manual. The paper `A Supplemental Document for `awk'' by John W. Pierce of the Chemistry Department at UC San Diego, pinpointed several issues relevant both to `awk' implementation and to this manual, that would otherwise have escaped us. The following people provided many helpful comments on Edition 0.13 of `The GAWK Manual': Rick Adams, Michael Brennan, Rich Burridge, Diane Close, Christopher ("Topher") Eliot, Michael Lijewski, Pat Rankin, Miriam Robbins, and Michal Jaegermann. The following people provided many helpful comments for Edition 1.0 of `Effective AWK Programming': Karl Berry, Michael Brennan, Darrel Hankerson, Michal Jaegermann, Michael Lijewski, and Miriam Robbins. Pat Rankin, Michal Jaegermann, Darrel Hankerson and Scott Deifik updated their respective sections for Edition 1.0. Robert J. Chassell provided much valuable advice on the use of Texinfo. He also deserves special thanks for convincing me *not* to title this Info file `How To Gawk Politely'. Karl Berry helped significantly with the TeX part of Texinfo. David Trueman deserves special credit; he has done a yeoman job of evolving `gawk' so that it performs well, and without bugs. Although he is no longer involved with `gawk', working with him on this project was a significant pleasure. Scott Deifik, Darrel Hankerson, Kai Uwe Rommel, Pat Rankin, and Michal Jaegermann (in no particular order) are long time members of the `gawk' "crack portability team." Without their hard work and help, `gawk' would not be nearly the fine program it is today. It has been and continues to be a pleasure working with this team of fine people. Jeffrey Friedl provided invaluable help in tracking down a number of last minute problems with regular expressions in `gawk' 3.0. David and I would like to thank Brian Kernighan of Bell Labs for invaluable assistance during the testing and debugging of `gawk', and for help in clarifying numerous points about the language. We could not have done nearly as good a job on either `gawk' or its documentation without his help. I would like to thank Marshall and Elaine Hartholz of Seattle, and Dr. Bert and Rita Schreiber of Detroit for large amounts of quiet vacation time in their homes, which allowed me to make significant progress on this Info file and on `gawk' itself. Phil Hughes of SSC contributed in a very important way by loaning me his laptop Linux system, not once, but twice, allowing me to do a lot of work while away from home. Finally, I must thank my wonderful wife, Miriam, for her patience through the many versions of this project, for her proof-reading, and for sharing me with the computer. I would like to thank my parents for their love, and for the grace with which they raised and educated me. I also must acknowledge my gratitude to G-d, for the many opportunities He has sent my way, as well as for the gifts He has given me with which to take advantage of those opportunities. Arnold Robbins Atlanta, Georgia February, 1997 Introduction ************ If you are like many computer users, you would frequently like to make changes in various text files wherever certain patterns appear, or extract data from parts of certain lines while discarding the rest. To write a program to do this in a language such as C or Pascal is a time-consuming inconvenience that may take many lines of code. The job may be easier with `awk'. The `awk' utility interprets a special-purpose programming language that makes it possible to handle simple data-reformatting jobs with just a few lines of code. The GNU implementation of `awk' is called `gawk'; it is fully upward compatible with the System V Release 4 version of `awk'. `gawk' is also upward compatible with the POSIX specification of the `awk' language. This means that all properly written `awk' programs should work with `gawk'. Thus, we usually don't distinguish between `gawk' and other `awk' implementations. Using `awk' you can: * manage small, personal databases * generate reports * validate data * produce indexes, and perform other document preparation tasks * even experiment with algorithms that can be adapted later to other computer languages Using This Book =============== The term `awk' refers to a particular program, and to the language you use to tell this program what to do. When we need to be careful, we call the program "the `awk' utility" and the language "the `awk' language." The term `gawk' refers to a version of `awk' developed as part the GNU project. The purpose of this Info file is to explain both the `awk' language and how to run the `awk' utility. The main purpose of the Info file is to explain the features of `awk', as defined in the POSIX standard. It does so in the context of one particular implementation, `gawk'. While doing so, it will also attempt to describe important differences between `gawk' and other `awk' implementations. Finally, any `gawk' features that are not in the POSIX standard for `awk' will be noted. The term "`awk' program" refers to a program written by you in the `awk' programming language. *Note Getting Started with `awk': Getting Started, for the bare essentials you need to know to start using `awk'. Some useful "one-liners" are included to give you a feel for the `awk' language (*note Useful One Line Programs: One-liners.). Many sample `awk' programs have been provided for you (*note A Library of `awk' Functions: Library Functions.; also *note Practical `awk' Programs: Sample Programs.). The entire `awk' language is summarized for quick reference in *Note `gawk' Summary: Gawk Summary. Look there if you just need to refresh your memory about a particular feature. If you find terms that you aren't familiar with, try looking them up in the glossary (*note Glossary::.). Most of the time complete `awk' programs are used as examples, but in some of the more advanced sections, only the part of the `awk' program that illustrates the concept being described is shown. While this Info file is aimed principally at people who have not been exposed to `awk', there is a lot of information here that even the `awk' expert should find useful. In particular, the description of POSIX `awk', and the example programs in *Note A Library of `awk' Functions: Library Functions, and *Note Practical `awk' Programs: Sample Programs, should be of interest. Dark Corners ------------ Who opened that window shade?!? Count Dracula Until the POSIX standard (and `The Gawk Manual'), many features of `awk' were either poorly documented, or not documented at all. Descriptions of such features (often called "dark corners") are noted in this Info file with "(d.c.)". They also appear in the index under the heading "dark corner." Typographical Conventions ========================= This Info file is written using Texinfo, the GNU documentation formatting language. A single Texinfo source file is used to produce both the printed and on-line versions of the documentation. This section briefly documents the typographical conventions used in Texinfo. Examples you would type at the command line are preceded by the common shell primary and secondary prompts, `$' and `>'. Output from the command is preceded by the glyph "-|". This typically represents the command's standard output. Error messages, and other output on the command's standard error, are preceded by the glyph "error-->". For example: $ echo hi on stdout -| hi on stdout $ echo hello on stderr 1>&2 error--> hello on stderr Characters that you type at the keyboard look `like this'. In particular, there are special characters called "control characters." These are characters that you type by holding down both the `CONTROL' key and another key, at the same time. For example, a `Control-d' is typed by first pressing and holding the `CONTROL' key, next pressing the `d' key, and finally releasing both keys. Data Files for the Examples =========================== Many of the examples in this Info file take their input from two sample data files. The first, called `BBS-list', represents a list of computer bulletin board systems together with information about those systems. The second data file, called `inventory-shipped', contains information about shipments on a monthly basis. In both files, each line is considered to be one "record". In the file `BBS-list', each record contains the name of a computer bulletin board, its phone number, the board's baud rate(s), and a code for the number of hours it is operational. An `A' in the last column means the board operates 24 hours a day. A `B' in the last column means the board operates evening and weekend hours, only. A `C' means the board operates only on weekends. aardvark 555-5553 1200/300 B alpo-net 555-3412 2400/1200/300 A barfly 555-7685 1200/300 A bites 555-1675 2400/1200/300 A camelot 555-0542 300 C core 555-2912 1200/300 C fooey 555-1234 2400/1200/300 B foot 555-6699 1200/300 B macfoo 555-6480 1200/300 A sdace 555-3430 2400/1200/300 A sabafoo 555-2127 1200/300 C The second data file, called `inventory-shipped', represents information about shipments during the year. Each record contains the month of the year, the number of green crates shipped, the number of red boxes shipped, the number of orange bags shipped, and the number of blue packages shipped, respectively. There are 16 entries, covering the 12 months of one year and four months of the next year. Jan 13 25 15 115 Feb 15 32 24 226 Mar 15 24 34 228 Apr 31 52 63 420 May 16 34 29 208 Jun 31 42 75 492 Jul 24 34 67 436 Aug 15 34 47 316 Sep 13 55 37 277 Oct 29 54 68 525 Nov 20 87 82 577 Dec 17 35 61 401 Jan 21 36 64 620 Feb 26 58 80 652 Mar 24 75 70 495 Apr 21 70 74 514 If you are reading this in GNU Emacs using Info, you can copy the regions of text showing these sample files into your own test files. This way you can try out the examples shown in the remainder of this document. You do this by using the command `M-x write-region' to copy text from the Info file into a file for use with `awk' (*Note Miscellaneous File Operations: (emacs)Misc File Ops, for more information). Using this information, create your own `BBS-list' and `inventory-shipped' files, and practice what you learn in this Info file. If you are using the stand-alone version of Info, see *Note Extracting Programs from Texinfo Source Files: Extract Program, for an `awk' program that will extract these data files from `gawk.texi', the Texinfo source file for this Info file. Getting Started with `awk' ************************** The basic function of `awk' is to search files for lines (or other units of text) that contain certain patterns. When a line matches one of the patterns, `awk' performs specified actions on that line. `awk' keeps processing input lines in this way until the end of the input files are reached. Programs in `awk' are different from programs in most other languages, because `awk' programs are "data-driven"; that is, you describe the data you wish to work with, and then what to do when you find it. Most other languages are "procedural"; you have to describe, in great detail, every step the program is to take. When working with procedural languages, it is usually much harder to clearly describe the data your program will process. For this reason, `awk' programs are often refreshingly easy to both write and read. When you run `awk', you specify an `awk' "program" that tells `awk' what to do. The program consists of a series of "rules". (It may also contain "function definitions", an advanced feature which we will ignore for now. *Note User-defined Functions: User-defined.) Each rule specifies one pattern to search for, and one action to perform when that pattern is found. Syntactically, a rule consists of a pattern followed by an action. The action is enclosed in curly braces to separate it from the pattern. Rules are usually separated by newlines. Therefore, an `awk' program looks like this: PATTERN { ACTION } PATTERN { ACTION } ... A Rose By Any Other Name ======================== The `awk' language has evolved over the years. Full details are provided in *Note The Evolution of the `awk' Language: Language History. The language described in this Info file is often referred to as "new `awk'." Because of this, many systems have multiple versions of `awk'. Some systems have an `awk' utility that implements the original version of the `awk' language, and a `nawk' utility for the new version. Others have an `oawk' for the "old `awk'" language, and plain `awk' for the new one. Still others only have one version, usually the new one.(1) All in all, this makes it difficult for you to know which version of `awk' you should run when writing your programs. The best advice we can give here is to check your local documentation. Look for `awk', `oawk', and `nawk', as well as for `gawk'. Chances are, you will have some version of new `awk' on your system, and that is what you should use when running your programs. (Of course, if you're reading this Info file, chances are good that you have `gawk'!) Throughout this Info file, whenever we refer to a language feature that should be available in any complete implementation of POSIX `awk', we simply use the term `awk'. When referring to a feature that is specific to the GNU implementation, we use the term `gawk'. ---------- Footnotes ---------- (1) Often, these systems use `gawk' for their `awk' implementation! How to Run `awk' Programs ========================= There are several ways to run an `awk' program. If the program is short, it is easiest to include it in the command that runs `awk', like this: awk 'PROGRAM' INPUT-FILE1 INPUT-FILE2 ... where PROGRAM consists of a series of patterns and actions, as described earlier. (The reason for the single quotes is described below, in *Note One-shot Throw-away `awk' Programs: One-shot.) When the program is long, it is usually more convenient to put it in a file and run it with a command like this: awk -f PROGRAM-FILE INPUT-FILE1 INPUT-FILE2 ... One-shot Throw-away `awk' Programs ---------------------------------- Once you are familiar with `awk', you will often type in simple programs the moment you want to use them. Then you can write the program as the first argument of the `awk' command, like this: awk 'PROGRAM' INPUT-FILE1 INPUT-FILE2 ... where PROGRAM consists of a series of PATTERNS and ACTIONS, as described earlier. This command format instructs the "shell", or command interpreter, to start `awk' and use the PROGRAM to process records in the input file(s). There are single quotes around PROGRAM so that the shell doesn't interpret any `awk' characters as special shell characters. They also cause the shell to treat all of PROGRAM as a single argument for `awk' and allow PROGRAM to be more than one line long. This format is also useful for running short or medium-sized `awk' programs from shell scripts, because it avoids the need for a separate file for the `awk' program. A self-contained shell script is more reliable since there are no other files to misplace. *Note Useful One Line Programs: One-liners, presents several short, self-contained programs. As an interesting side point, the command awk '/foo/' FILES ... is essentially the same as egrep foo FILES ... Running `awk' without Input Files --------------------------------- You can also run `awk' without any input files. If you type the command line: awk 'PROGRAM' then `awk' applies the PROGRAM to the "standard input", which usually means whatever you type on the terminal. This continues until you indicate end-of-file by typing `Control-d'. (On other operating systems, the end-of-file character may be different. For example, on OS/2 and MS-DOS, it is `Control-z'.) For example, the following program prints a friendly piece of advice (from Douglas Adams' `The Hitchhiker's Guide to the Galaxy'), to keep you from worrying about the complexities of computer programming (`BEGIN' is a feature we haven't discussed yet). $ awk "BEGIN { print \"Don't Panic!\" }" -| Don't Panic! This program does not read any input. The `\' before each of the inner double quotes is necessary because of the shell's quoting rules, in particular because it mixes both single quotes and double quotes. This next simple `awk' program emulates the `cat' utility; it copies whatever you type at the keyboard to its standard output. (Why this works is explained shortly.) $ awk '{ print }' Now is the time for all good men -| Now is the time for all good men to come to the aid of their country. -| to come to the aid of their country. Four score and seven years ago, ... -| Four score and seven years ago, ... What, me worry? -| What, me worry? Control-d Running Long Programs --------------------- Sometimes your `awk' programs can be very long. In this case it is more convenient to put the program into a separate file. To tell `awk' to use that file for its program, you type: awk -f SOURCE-FILE INPUT-FILE1 INPUT-FILE2 ... The `-f' instructs the `awk' utility to get the `awk' program from the file SOURCE-FILE. Any file name can be used for SOURCE-FILE. For example, you could put the program: BEGIN { print "Don't Panic!" } into the file `advice'. Then this command: awk -f advice does the same thing as this one: awk "BEGIN { print \"Don't Panic!\" }" which was explained earlier (*note Running `awk' without Input Files: Read Terminal.). Note that you don't usually need single quotes around the file name that you specify with `-f', because most file names don't contain any of the shell's special characters. Notice that in `advice', the `awk' program did not have single quotes around it. The quotes are only needed for programs that are provided on the `awk' command line. If you want to identify your `awk' program files clearly as such, you can add the extension `.awk' to the file name. This doesn't affect the execution of the `awk' program, but it does make "housekeeping" easier. Executable `awk' Programs ------------------------- Once you have learned `awk', you may want to write self-contained `awk' scripts, using the `#!' script mechanism. You can do this on many Unix systems(1) (and someday on the GNU system). For example, you could update the file `advice' to look like this: #! /bin/awk -f BEGIN { print "Don't Panic!" } After making this file executable (with the `chmod' utility), you can simply type `advice' at the shell, and the system will arrange to run `awk'(2) as if you had typed `awk -f advice'. $ advice -| Don't Panic! Self-contained `awk' scripts are useful when you want to write a program which users can invoke without their having to know that the program is written in `awk'. Some older systems do not support the `#!' mechanism. You can get a similar effect using a regular shell script. It would look something like this: : The colon ensures execution by the standard shell. awk 'PROGRAM' "$@" Using this technique, it is *vital* to enclose the PROGRAM in single quotes to protect it from interpretation by the shell. If you omit the quotes, only a shell wizard can predict the results. The `"$@"' causes the shell to forward all the command line arguments to the `awk' program, without interpretation. The first line, which starts with a colon, is used so that this shell script will work even if invoked by a user who uses the C shell. (Not all older systems obey this convention, but many do.) ---------- Footnotes ---------- (1) The `#!' mechanism works on Linux systems, Unix systems derived from Berkeley Unix, System V Release 4, and some System V Release 3 systems. (2) The line beginning with `#!' lists the full file name of an interpreter to be run, and an optional initial command line argument to pass to that interpreter. The operating system then runs the interpreter with the given argument and the full argument list of the executed program. The first argument in the list is the full file name of the `awk' program. The rest of the argument list will either be options to `awk', or data files, or both. Comments in `awk' Programs -------------------------- A "comment" is some text that is included in a program for the sake of human readers; it is not really part of the program. Comments can explain what the program does, and how it works. Nearly all programming languages have provisions for comments, because programs are typically hard to understand without their extra help. In the `awk' language, a comment starts with the sharp sign character, `#', and continues to the end of the line. The `#' does not have to be the first character on the line. The `awk' language ignores the rest of a line following a sharp sign. For example, we could have put the following into `advice': # This program prints a nice friendly message. It helps # keep novice users from being afraid of the computer. BEGIN { print "Don't Panic!" } You can put comment lines into keyboard-composed throw-away `awk' programs also, but this usually isn't very useful; the purpose of a comment is to help you or another person understand the program at a later time. A Very Simple Example ===================== The following command runs a simple `awk' program that searches the input file `BBS-list' for the string of characters: `foo'. (A string of characters is usually called a "string". The term "string" is perhaps based on similar usage in English, such as "a string of pearls," or, "a string of cars in a train.") awk '/foo/ { print $0 }' BBS-list When lines containing `foo' are found, they are printed, because `print $0' means print the current line. (Just `print' by itself means the same thing, so we could have written that instead.) You will notice that slashes, `/', surround the string `foo' in the `awk' program. The slashes indicate that `foo' is a pattern to search for. This type of pattern is called a "regular expression", and is covered in more detail later (*note Regular Expressions: Regexp.). The pattern is allowed to match parts of words. There are single-quotes around the `awk' program so that the shell won't interpret any of it as special shell characters. Here is what this program prints: $ awk '/foo/ { print $0 }' BBS-list -| fooey 555-1234 2400/1200/300 B -| foot 555-6699 1200/300 B -| macfoo 555-6480 1200/300 A -| sabafoo 555-2127 1200/300 C In an `awk' rule, either the pattern or the action can be omitted, but not both. If the pattern is omitted, then the action is performed for *every* input line. If the action is omitted, the default action is to print all lines that match the pattern. Thus, we could leave out the action (the `print' statement and the curly braces) in the above example, and the result would be the same: all lines matching the pattern `foo' would be printed. By comparison, omitting the `print' statement but retaining the curly braces makes an empty action that does nothing; then no lines would be printed. An Example with Two Rules ========================= The `awk' utility reads the input files one line at a time. For each line, `awk' tries the patterns of each of the rules. If several patterns match then several actions are run, in the order in which they appear in the `awk' program. If no patterns match, then no actions are run. After processing all the rules (perhaps none) that match the line, `awk' reads the next line (however, *note The `next' Statement: Next Statement., and also *note The `nextfile' Statement: Nextfile Statement.). This continues until the end of the file is reached. For example, the `awk' program: /12/ { print $0 } /21/ { print $0 } contains two rules. The first rule has the string `12' as the pattern and `print $0' as the action. The second rule has the string `21' as the pattern and also has `print $0' as the action. Each rule's action is enclosed in its own pair of braces. This `awk' program prints every line that contains the string `12' *or* the string `21'. If a line contains both strings, it is printed twice, once by each rule. This is what happens if we run this program on our two sample data files, `BBS-list' and `inventory-shipped', as shown here: $ awk '/12/ { print $0 } > /21/ { print $0 }' BBS-list inventory-shipped -| aardvark 555-5553 1200/300 B -| alpo-net 555-3412 2400/1200/300 A -| barfly 555-7685 1200/300 A -| bites 555-1675 2400/1200/300 A -| core 555-2912 1200/300 C -| fooey 555-1234 2400/1200/300 B -| foot 555-6699 1200/300 B -| macfoo 555-6480 1200/300 A -| sdace 555-3430 2400/1200/300 A -| sabafoo 555-2127 1200/300 C -| sabafoo 555-2127 1200/300 C -| Jan 21 36 64 620 -| Apr 21 70 74 514 Note how the line in `BBS-list' beginning with `sabafoo' was printed twice, once for each rule. A More Complex Example ====================== Here is an example to give you an idea of what typical `awk' programs do. This example shows how `awk' can be used to summarize, select, and rearrange the output of another utility. It uses features that haven't been covered yet, so don't worry if you don't understand all the details. ls -lg | awk '$6 == "Nov" { sum += $5 } END { print sum }' This command prints the total number of bytes in all the files in the current directory that were last modified in November (of any year). (In the C shell you would need to type a semicolon and then a backslash at the end of the first line; in a POSIX-compliant shell, such as the Bourne shell or Bash, the GNU Bourne-Again shell, you can type the example as shown.) The `ls -lg' part of this example is a system command that gives you a listing of the files in a directory, including file size and the date the file was last modified. Its output looks like this: -rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile -rw-r--r-- 1 arnold user 10809 Nov 7 13:03 gawk.h -rw-r--r-- 1 arnold user 983 Apr 13 12:14 gawk.tab.h -rw-r--r-- 1 arnold user 31869 Jun 15 12:20 gawk.y -rw-r--r-- 1 arnold user 22414 Nov 7 13:03 gawk1.c -rw-r--r-- 1 arnold user 37455 Nov 7 13:03 gawk2.c -rw-r--r-- 1 arnold user 27511 Dec 9 13:07 gawk3.c -rw-r--r-- 1 arnold user 7989 Nov 7 13:03 gawk4.c The first field contains read-write permissions, the second field contains the number of links to the file, and the third field identifies the owner of the file. The fourth field identifies the group of the file. The fifth field contains the size of the file in bytes. The sixth, seventh and eighth fields contain the month, day, and time, respectively, that the file was last modified. Finally, the ninth field contains the name of the file. The `$6 == "Nov"' in our `awk' program is an expression that tests whether the sixth field of the output from `ls -lg' matches the string `Nov'. Each time a line has the string `Nov' for its sixth field, the action `sum += $5' is performed. This adds the fifth field (the file size) to the variable `sum'. As a result, when `awk' has finished reading all the input lines, `sum' is the sum of the sizes of files whose lines matched the pattern. (This works because `awk' variables are automatically initialized to zero.) After the last line of output from `ls' has been processed, the `END' rule is executed, and the value of `sum' is printed. In this example, the value of `sum' would be 80600. These more advanced `awk' techniques are covered in later sections (*note Overview of Actions: Action Overview.). Before you can move on to more advanced `awk' programming, you have to know how `awk' interprets your input and displays your output. By manipulating fields and using `print' statements, you can produce some very useful and impressive looking reports. `awk' Statements Versus Lines ============================= Most often, each line in an `awk' program is a separate statement or separate rule, like this: awk '/12/ { print $0 } /21/ { print $0 }' BBS-list inventory-shipped However, `gawk' will ignore newlines after any of the following: , { ? : || && do else A newline at any other point is considered the end of the statement. (Splitting lines after `?' and `:' is a minor `gawk' extension. The `?' and `:' referred to here is the three operand conditional expression described in *Note Conditional Expressions: Conditional Exp.) If you would like to split a single statement into two lines at a point where a newline would terminate it, you can "continue" it by ending the first line with a backslash character, `\'. The backslash must be the final character on the line to be recognized as a continuation character. This is allowed absolutely anywhere in the statement, even in the middle of a string or regular expression. For example: awk '/This regular expression is too long, so continue it\ on the next line/ { print $1 }' We have generally not used backslash continuation in the sample programs in this Info file. Since in `gawk' there is no limit on the length of a line, it is never strictly necessary; it just makes programs more readable. For this same reason, as well as for clarity, we have kept most statements short in the sample programs presented throughout the Info file. Backslash continuation is most useful when your `awk' program is in a separate source file, instead of typed in on the command line. You should also note that many `awk' implementations are more particular about where you may use backslash continuation. For example, they may not allow you to split a string constant using backslash continuation. Thus, for maximal portability of your `awk' programs, it is best not to split your lines in the middle of a regular expression or a string. *Caution: backslash continuation does not work as described above with the C shell.* Continuation with backslash works for `awk' programs in files, and also for one-shot programs *provided* you are using a POSIX-compliant shell, such as the Bourne shell or Bash, the GNU Bourne-Again shell. But the C shell (`csh') behaves differently! There, you must use two backslashes in a row, followed by a newline. Note also that when using the C shell, *every* newline in your awk program must be escaped with a backslash. To illustrate: % awk 'BEGIN { \ ? print \\ ? "hello, world" \ ? }' -| hello, world Here, the `%' and `?' are the C shell's primary and secondary prompts, analogous to the standard shell's `$' and `>'. `awk' is a line-oriented language. Each rule's action has to begin on the same line as the pattern. To have the pattern and action on separate lines, you *must* use backslash continuation--there is no other way. Note that backslash continuation and comments do not mix. As soon as `awk' sees the `#' that starts a comment, it ignores *everything* on the rest of the line. For example: $ gawk 'BEGIN { print "dont panic" # a friendly \ > BEGIN rule > }' error--> gawk: cmd. line:2: BEGIN rule error--> gawk: cmd. line:2: ^ parse error Here, it looks like the backslash would continue the comment onto the next line. However, the backslash-newline combination is never even noticed, since it is "hidden" inside the comment. Thus, the `BEGIN' is noted as a syntax error. When `awk' statements within one rule are short, you might want to put more than one of them on a line. You do this by separating the statements with a semicolon, `;'. This also applies to the rules themselves. Thus, the previous program could have been written: /12/ { print $0 } ; /21/ { print $0 } *Note:* the requirement that rules on the same line must be separated with a semicolon was not in the original `awk' language; it was added for consistency with the treatment of statements within an action. Other Features of `awk' ======================= The `awk' language provides a number of predefined, or built-in variables, which your programs can use to get information from `awk'. There are other variables your program can set to control how `awk' processes your data. In addition, `awk' provides a number of built-in functions for doing common computational and string related operations. As we develop our presentation of the `awk' language, we introduce most of the variables and many of the functions. They are defined systematically in *Note Built-in Variables::, and *Note Built-in Functions: Built-in. When to Use `awk' ================= You might wonder how `awk' might be useful for you. Using utility programs, advanced patterns, field separators, arithmetic statements, and other selection criteria, you can produce much more complex output. The `awk' language is very useful for producing reports from large amounts of raw data, such as summarizing information from the output of other utility programs like `ls'. (*Note A More Complex Example: More Complex.) Programs written with `awk' are usually much smaller than they would be in other languages. This makes `awk' programs easy to compose and use. Often, `awk' programs can be quickly composed at your terminal, used once, and thrown away. Since `awk' programs are interpreted, you can avoid the (usually lengthy) compilation part of the typical edit-compile-test-debug cycle of software development. Complex programs have been written in `awk', including a complete retargetable assembler for eight-bit microprocessors (*note Glossary::., for more information) and a microcode assembler for a special purpose Prolog computer. However, `awk''s capabilities are strained by tasks of such complexity. If you find yourself writing `awk' scripts of more than, say, a few hundred lines, you might consider using a different programming language. Emacs Lisp is a good choice if you need sophisticated string or pattern matching capabilities. The shell is also good at string and pattern matching; in addition, it allows powerful use of the system utilities. More conventional languages, such as C, C++, and Lisp, offer better facilities for system programming and for managing the complexity of large programs. Programs in these languages may require more lines of source code than the equivalent `awk' programs, but they are easier to maintain and usually run more efficiently. Useful One Line Programs ************************ Many useful `awk' programs are short, just a line or two. Here is a collection of useful, short programs to get you started. Some of these programs contain constructs that haven't been covered yet. The description of the program will give you a good idea of what is going on, but please read the rest of the Info file to become an `awk' expert! Most of the examples use a data file named `data'. This is just a placeholder; if you were to use these programs yourself, you would substitute your own file names for `data'. Since you are reading this in Info, each line of the example code is enclosed in quotes, to represent text that you would type literally. The examples themselves represent shell commands that use single quotes to keep the shell from interpreting the contents of the program. When reading the examples, focus on the text between the open and close quotes. `awk '{ if (length($0) > max) max = length($0) }' ` END { print max }' data' This program prints the length of the longest input line. `awk 'length($0) > 80' data' This program prints every line that is longer than 80 characters. The sole rule has a relational expression as its pattern, and has no action (so the default action, printing the record, is used). `expand data | awk '{ if (x < length()) x = length() }' ` END { print "maximum line length is " x }'' This program prints the length of the longest line in `data'. The input is processed by the `expand' program to change tabs into spaces, so the widths compared are actually the right-margin columns. `awk 'NF > 0' data' This program prints every line that has at least one field. This is an easy way to delete blank lines from a file (or rather, to create a new file similar to the old file but from which the blank lines have been deleted). `awk 'BEGIN { for (i = 1; i <= 7; i++)' ` print int(101 * rand()) }'' This program prints seven random numbers from zero to 100, inclusive. `ls -lg FILES | awk '{ x += $5 } ; END { print "total bytes: " x }'' This program prints the total number of bytes used by FILES. `ls -lg FILES | awk '{ x += $5 }' ` END { print "total K-bytes: " (x + 1023)/1024 }'' This program prints the total number of kilobytes used by FILES. `awk -F: '{ print $1 }' /etc/passwd | sort' This program prints a sorted list of the login names of all users. `awk 'END { print NR }' data' This program counts lines in a file. `awk 'NR % 2 == 0' data' This program prints the even numbered lines in the data file. If you were to use the expression `NR % 2 == 1' instead, it would print the odd numbered lines. Regular Expressions ******************* A "regular expression", or "regexp", is a way of describing a set of strings. Because regular expressions are such a fundamental part of `awk' programming, their format and use deserve a separate chapter. A regular expression enclosed in slashes (`/') is an `awk' pattern that matches every input record whose text belongs to that set. The simplest regular expression is a sequence of letters, numbers, or both. Such a regexp matches any string that contains that sequence. Thus, the regexp `foo' matches any string containing `foo'. Therefore, the pattern `/foo/' matches any input record containing the three characters `foo', *anywhere* in the record. Other kinds of regexps let you specify more complicated classes of strings. How to Use Regular Expressions ============================== A regular expression can be used as a pattern by enclosing it in slashes. Then the regular expression is tested against the entire text of each record. (Normally, it only needs to match some part of the text in order to succeed.) For example, this prints the second field of each record that contains the three characters `foo' anywhere in it: $ awk '/foo/ { print $2 }' BBS-list -| 555-1234 -| 555-6699 -| 555-6480 -| 555-2127 Regular expressions can also be used in matching expressions. These expressions allow you to specify the string to match against; it need not be the entire current input record. The two operators, `~' and `!~', perform regular expression comparisons. Expressions using these operators can be used as patterns or in `if', `while', `for', and `do' statements. (*Note Control Statements in Actions: Statements.) `EXP ~ /REGEXP/' This is true if the expression EXP (taken as a string) is matched by REGEXP. The following example matches, or selects, all input records with the upper-case letter `J' somewhere in the first field: $ awk '$1 ~ /J/' inventory-shipped -| Jan 13 25 15 115 -| Jun 31 42 75 492 -| Jul 24 34 67 436 -| Jan 21 36 64 620 So does this: awk '{ if ($1 ~ /J/) print }' inventory-shipped `EXP !~ /REGEXP/' This is true if the expression EXP (taken as a character string) is *not* matched by REGEXP. The following example matches, or selects, all input records whose first field *does not* contain the upper-case letter `J': $ awk '$1 !~ /J/' inventory-shipped -| Feb 15 32 24 226 -| Mar 15 24 34 228 -| Apr 31 52 63 420 -| May 16 34 29 208 ... When a regexp is written enclosed in slashes, like `/foo/', we call it a "regexp constant", much like `5.27' is a numeric constant, and `"foo"' is a string constant. Escape Sequences ================ Some characters cannot be included literally in string constants (`"foo"') or regexp constants (`/foo/'). You represent them instead with "escape sequences", which are character sequences beginning with a backslash (`\'). One use of an escape sequence is to include a double-quote character in a string constant. Since a plain double-quote would end the string, you must use `\"' to represent an actual double-quote character as a part of the string. For example: $ awk 'BEGIN { print "He said \"hi!\" to her." }' -| He said "hi!" to her. The backslash character itself is another character that cannot be included normally; you write `\\' to put one backslash in the string or regexp. Thus, the string whose contents are the two characters `"' and `\' must be written `"\"\\"'. Another use of backslash is to represent unprintable characters such as tab or newline. While there is nothing to stop you from entering most unprintable characters directly in a string constant or regexp constant, they may look ugly. Here is a table of all the escape sequences used in `awk', and what they represent. Unless noted otherwise, all of these escape sequences apply to both string constants and regexp constants. `\\' A literal backslash, `\'. `\a' The "alert" character, `Control-g', ASCII code 7 (BEL). `\b' Backspace, `Control-h', ASCII code 8 (BS). `\f' Formfeed, `Control-l', ASCII code 12 (FF). `\n' Newline, `Control-j', ASCII code 10 (LF). `\r' Carriage return, `Control-m', ASCII code 13 (CR). `\t' Horizontal tab, `Control-i', ASCII code 9 (HT). `\v' Vertical tab, `Control-k', ASCII code 11 (VT). `\NNN' The octal value NNN, where NNN are one to three digits between `0' and `7'. For example, the code for the ASCII ESC (escape) character is `\033'. `\xHH...' The hexadecimal value HH, where HH are hexadecimal digits (`0' through `9' and either `A' through `F' or `a' through `f'). Like the same construct in ANSI C, the escape sequence continues until the first non-hexadecimal digit is seen. However, using more than two hexadecimal digits produces undefined results. (The `\x' escape sequence is not allowed in POSIX `awk'.) `\/' A literal slash (necessary for regexp constants only). You use this when you wish to write a regexp constant that contains a slash. Since the regexp is delimited by slashes, you need to escape the slash that is part of the pattern, in order to tell `awk' to keep processing the rest of the regexp. `\"' A literal double-quote (necessary for string constants only). You use this when you wish to write a string constant that contains a double-quote. Since the string is delimited by double-quotes, you need to escape the quote that is part of the string, in order to tell `awk' to keep processing the rest of the string. In `gawk', there are additional two character sequences that begin with backslash that have special meaning in regexps. *Note Additional Regexp Operators Only in `gawk': GNU Regexp Operators. In a string constant, what happens if you place a backslash before something that is not one of the characters listed above? POSIX `awk' purposely leaves this case undefined. There are two choices. * Strip the backslash out. This is what Unix `awk' and `gawk' both do. For example, `"a\qc"' is the same as `"aqc"'. * Leave the backslash alone. Some other `awk' implementations do this. In such implementations, `"a\qc"' is the same as if you had typed `"a\\qc"'. In a regexp, a backslash before any character that is not in the above table, and not listed in *Note Additional Regexp Operators Only in `gawk': GNU Regexp Operators, means that the next character should be taken literally, even if it would normally be a regexp operator. E.g., `/a\+b/' matches the three characters `a+b'. For complete portability, do not use a backslash before any character not listed in the table above. Another interesting question arises. Suppose you use an octal or hexadecimal escape to represent a regexp metacharacter (*note Regular Expression Operators: Regexp Operators.). Does `awk' treat the character as literal character, or as a regexp operator? It turns out that historically, such characters were taken literally (d.c.). However, the POSIX standard indicates that they should be treated as real metacharacters, and this is what `gawk' does. However, in compatibility mode (*note Command Line Options: Options.), `gawk' treats the characters represented by octal and hexadecimal escape sequences literally when used in regexp constants. Thus, `/a\52b/' is equivalent to `/a\*b/'. To summarize: 1. The escape sequences in the table above are always processed first, for both string constants and regexp constants. This happens very early, as soon as `awk' reads your program. 2. `gawk' processes both regexp constants and dynamic regexps (*note Using Dynamic Regexps: Computed Regexps.), for the special operators listed in *Note Additional Regexp Operators Only in `gawk': GNU Regexp Operators. 3. A backslash before any other character means to treat that character literally. Regular Expression Operators ============================ You can combine regular expressions with the following characters, called "regular expression operators", or "metacharacters", to increase the power and versatility of regular expressions. The escape sequences described in *Note Escape Sequences::, are valid inside a regexp. They are introduced by a `\'. They are recognized and converted into the corresponding real characters as the very first step in processing regexps. Here is a table of metacharacters. All characters that are not escape sequences and that are not listed in the table stand for themselves. `\' This is used to suppress the special meaning of a character when matching. For example: \$ matches the character `$'. `^' This matches the beginning of a string. For example: ^@chapter matches the `@chapter' at the beginning of a string, and can be used to identify chapter beginnings in Texinfo source files. The `^' is known as an "anchor", since it anchors the pattern to matching only at the beginning of the string. It is important to realize that `^' does not match the beginning of a line embedded in a string. In this example the condition is not true: if ("line1\nLINE 2" ~ /^L/) ... `$' This is similar to `^', but it matches only at the end of a string. For example: p$ matches a record that ends with a `p'. The `$' is also an anchor, and also does not match the end of a line embedded in a string. In this example the condition is not true: if ("line1\nLINE 2" ~ /1$/) ... `.' The period, or dot, matches any single character, *including* the newline character. For example: .P matches any single character followed by a `P' in a string. Using concatenation we can make a regular expression like `U.A', which matches any three-character sequence that begins with `U' and ends with `A'. In strict POSIX mode (*note Command Line Options: Options.), `.' does not match the NUL character, which is a character with all bits equal to zero. Otherwise, NUL is just another character. Other versions of `awk' may not be able to match the NUL character. `[...]' This is called a "character list". It matches any *one* of the characters that are enclosed in the square brackets. For example: [MVX] matches any one of the characters `M', `V', or `X' in a string. Ranges of characters are indicated by using a hyphen between the beginning and ending characters, and enclosing the whole thing in brackets. For example: [0-9] matches any digit. Multiple ranges are allowed. E.g., the list `[A-Za-z0-9]' is a common way to express the idea of "all alphanumeric characters." To include one of the characters `\', `]', `-' or `^' in a character list, put a `\' in front of it. For example: [d\]] matches either `d', or `]'. This treatment of `\' in character lists is compatible with other `awk' implementations, and is also mandated by POSIX. The regular expressions in `awk' are a superset of the POSIX specification for Extended Regular Expressions (EREs). POSIX EREs are based on the regular expressions accepted by the traditional `egrep' utility. "Character classes" are a new feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but where the actual characters themselves can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs in the USA and in France. A character class is only valid in a regexp *inside* the brackets of a character list. Character classes consist of `[:', a keyword denoting the class, and `:]'. Here are the character classes defined by the POSIX standard. `[:alnum:]' Alphanumeric characters. `[:alpha:]' Alphabetic characters. `[:blank:]' Space and tab characters. `[:cntrl:]' Control characters. `[:digit:]' Numeric characters. `[:graph:]' Characters that are printable and are also visible. (A space is printable, but not visible, while an `a' is both.) `[:lower:]' Lower-case alphabetic characters. `[:print:]' Printable characters (characters that are not control characters.) `[:punct:]' Punctuation characters (characters that are not letter, digits, control characters, or space characters). `[:space:]' Space characters (such as space, tab, and formfeed, to name a few). `[:upper:]' Upper-case alphabetic characters. `[:xdigit:]' Characters that are hexadecimal digits. For example, before the POSIX standard, to match alphanumeric characters, you had to write `/[A-Za-z0-9]/'. If your character set had other alphabetic characters in it, this would not match them. With the POSIX character classes, you can write `/[[:alnum:]]/', and this will match *all* the alphabetic and numeric characters in your character set. Two additional special sequences can appear in character lists. These apply to non-ASCII character sets, which can have single symbols (called "collating elements") that are represented with more than one character, as well as several characters that are equivalent for "collating", or sorting, purposes. (E.g., in French, a plain "e" and a grave-accented "`e" are equivalent.) Collating Symbols A "collating symbol" is a multi-character collating element enclosed in `[.' and `.]'. For example, if `ch' is a collating element, then `[[.ch.]]' is a regexp that matches this collating element, while `[ch]' is a regexp that matches either `c' or `h'. Equivalence Classes An "equivalence class" is a locale-specific name for a list of characters that are equivalent. The name is enclosed in `[=' and `=]'. For example, the name `e' might be used to represent all of "e," "`e," and "'e." In this case, `[[=e]]' is a regexp that matches any of `e', `'e', or ``e'. These features are very valuable in non-English speaking locales. *Caution:* The library functions that `gawk' uses for regular expression matching currently only recognize POSIX character classes; they do not recognize collating symbols or equivalence classes. `[^ ...]' This is a "complemented character list". The first character after the `[' *must* be a `^'. It matches any characters *except* those in the square brackets. For example: [^0-9] matches any character that is not a digit. `|' This is the "alternation operator", and it is used to specify alternatives. For example: ^P|[0-9] matches any string that matches either `^P' or `[0-9]'. This means it matches any string that starts with `P' or contains a digit. The alternation applies to the largest possible regexps on either side. In other words, `|' has the lowest precedence of all the regular expression operators. `(...)' Parentheses are used for grouping in regular expressions as in arithmetic. They can be used to concatenate regular expressions containing the alternation operator, `|'. For example, `@(samp|code)\{[^}]+\}' matches both `@code{foo}' and `@samp{bar}'. (These are Texinfo formatting control sequences.) `*' This symbol means that the preceding regular expression is to be repeated as many times as necessary to find a match. For example: ph* applies the `*' symbol to the preceding `h' and looks for matches of one `p' followed by any number of `h's. This will also match just `p' if no `h's are present. The `*' repeats the *smallest* possible preceding expression. (Use parentheses if you wish to repeat a larger expression.) It finds as many repetitions as possible. For example: awk '/\(c[ad][ad]*r x\)/ { print }' sample prints every record in `sample' containing a string of the form `(car x)', `(cdr x)', `(cadr x)', and so on. Notice the escaping of the parentheses by preceding them with backslashes. `+' This symbol is similar to `*', but the preceding expression must be matched at least once. This means that: wh+y would match `why' and `whhy' but not `wy', whereas `wh*y' would match all three of these strings. This is a simpler way of writing the last `*' example: awk '/\(c[ad]+r x\)/ { print }' sample `?' This symbol is similar to `*', but the preceding expression can be matched either once or not at all. For example: fe?d will match `fed' and `fd', but nothing else. `{N}' `{N,}' `{N,M}' One or two numbers inside braces denote an "interval expression". If there is one number in the braces, the preceding regexp is repeated N times. If there are two numbers separated by a comma, the preceding regexp is repeated N to M times. If there is one number followed by a comma, then the preceding regexp is repeated at least N times. `wh{3}y' matches `whhhy' but not `why' or `whhhhy'. `wh{3,5}y' matches `whhhy' or `whhhhy' or `whhhhhy', only. `wh{2,}y' matches `whhy' or `whhhy', and so on. Interval expressions were not traditionally available in `awk'. As part of the POSIX standard they were added, to make `awk' and `egrep' consistent with each other. However, since old programs may use `{' and `}' in regexp constants, by default `gawk' does *not* match interval expressions in regexps. If either `--posix' or `--re-interval' are specified (*note Command Line Options: Options.), then interval expressions are allowed in regexps. In regular expressions, the `*', `+', and `?' operators, as well as the braces `{' and `}', have the highest precedence, followed by concatenation, and finally by `|'. As in arithmetic, parentheses can change how operators are grouped. If `gawk' is in compatibility mode (*note Command Line Options: Options.), character classes and interval expressions are not available in regular expressions. The next node discusses the GNU-specific regexp operators, and provides more detail concerning how command line options affect the way `gawk' interprets the characters in regular expressions. Additional Regexp Operators Only in `gawk' ========================================== GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section, and are specific to `gawk'; they are not available in other `awk' implementations. Most of the additional operators are for dealing with word matching. For our purposes, a "word" is a sequence of one or more letters, digits, or underscores (`_'). `\w' This operator matches any word-constituent character, i.e. any letter, digit, or underscore. Think of it as a short-hand for `[[:alnum:]_]'. `\W' This operator matches any character that is not word-constituent. Think of it as a short-hand for `[^[:alnum:]_]'. `\<' This operator matches the empty string at the beginning of a word. For example, `/\' This operator matches the empty string at the end of a word. For example, `/stow\>/' matches `stow', but not `stowaway'. `\y' This operator matches the empty string at either the beginning or the end of a word (the word boundar*y*). For example, `\yballs?\y' matches either `ball' or `balls' as a separate word. `\B' This operator matches the empty string within a word. In other words, `\B' matches the empty string that occurs between two word-constituent characters. For example, `/\Brat\B/' matches `crate', but it does not match `dirty rat'. `\B' is essentially the opposite of `\y'. There are two other operators that work on buffers. In Emacs, a "buffer" is, naturally, an Emacs buffer. For other programs, the regexp library routines that `gawk' uses consider the entire string to be matched as the buffer. For `awk', since `^' and `$' always work in terms of the beginning and end of strings, these operators don't add any new capabilities. They are provided for compatibility with other GNU software. `\`' This operator matches the empty string at the beginning of the buffer. `\'' This operator matches the empty string at the end of the buffer. In other GNU software, the word boundary operator is `\b'. However, that conflicts with the `awk' language's definition of `\b' as backspace, so `gawk' uses a different letter. An alternative method would have been to require two backslashes in the GNU operators, but this was deemed to be too confusing, and the current method of using `\y' for the GNU `\b' appears to be the lesser of two evils. The various command line options (*note Command Line Options: Options.) control how `gawk' interprets characters in regexps. No options In the default case, `gawk' provide all the facilities of POSIX regexps and the GNU regexp operators described in *Note Regular Expression Operators: Regexp Operators. However, interval expressions are not supported. `--posix' Only POSIX regexps are supported, the GNU operators are not special (e.g., `\w' matches a literal `w'). Interval expressions are allowed. `--traditional' Traditional Unix `awk' regexps are matched. The GNU operators are not special, interval expressions are not available, and neither are the POSIX character classes (`[[:alnum:]]' and so on). Characters described by octal and hexadecimal escape sequences are treated literally, even if they represent regexp metacharacters. `--re-interval' Allow interval expressions in regexps, even if `--traditional' has been provided. Case-sensitivity in Matching ============================ Case is normally significant in regular expressions, both when matching ordinary characters (i.e. not metacharacters), and inside character sets. Thus a `w' in a regular expression matches only a lower-case `w' and not an upper-case `W'. The simplest way to do a case-independent match is to use a character list: `[Ww]'. However, this can be cumbersome if you need to use it often; and it can make the regular expressions harder to read. There are two alternatives that you might prefer. One way to do a case-insensitive match at a particular point in the program is to convert the data to a single case, using the `tolower' or `toupper' built-in string functions (which we haven't discussed yet; *note Built-in Functions for String Manipulation: String Functions.). For example: tolower($1) ~ /foo/ { ... } converts the first field to lower-case before matching against it. This will work in any POSIX-compliant implementation of `awk'. Another method, specific to `gawk', is to set the variable `IGNORECASE' to a non-zero value (*note Built-in Variables::.). When `IGNORECASE' is not zero, *all* regexp and string operations ignore case. Changing the value of `IGNORECASE' dynamically controls the case sensitivity of your program as it runs. Case is significant by default because `IGNORECASE' (like most variables) is initialized to zero. x = "aB" if (x ~ /ab/) ... # this test will fail IGNORECASE = 1 if (x ~ /ab/) ... # now it will succeed In general, you cannot use `IGNORECASE' to make certain rules case-insensitive and other rules case-sensitive, because there is no way to set `IGNORECASE' just for the pattern of a particular rule. To do this, you must use character lists or `tolower'. However, one thing you can do only with `IGNORECASE' is turn case-sensitivity on or off dynamically for all the rules at once. `IGNORECASE' can be set on the command line, or in a `BEGIN' rule (*note Other Command Line Arguments: Other Arguments.; also *note Startup and Cleanup Actions: Using BEGIN/END.). Setting `IGNORECASE' from the command line is a way to make a program case-insensitive without having to edit it. Prior to version 3.0 of `gawk', the value of `IGNORECASE' only affected regexp operations. It did not affect string comparison with `==', `!=', and so on. Beginning with version 3.0, both regexp and string comparison operations are affected by `IGNORECASE'. Beginning with version 3.0 of `gawk', the equivalences between upper-case and lower-case characters are based on the ISO-8859-1 (ISO Latin-1) character set. This character set is a superset of the traditional 128 ASCII characters, that also provides a number of characters suitable for use with European languages. The value of `IGNORECASE' has no effect if `gawk' is in compatibility mode (*note Command Line Options: Options.). Case is always significant in compatibility mode. How Much Text Matches? ====================== Consider the following example: echo aaaabcd | awk '{ sub(/a+/, ""); print }' This example uses the `sub' function (which we haven't discussed yet, *note Built-in Functions for String Manipulation: String Functions.) to make a change to the input record. Here, the regexp `/a+/' indicates "one or more `a' characters," and the replacement text is `'. The input contains four `a' characters. What will the output be? In other words, how many is "one or more"--will `awk' match two, three, or all four `a' characters? The answer is, `awk' (and POSIX) regular expressions always match the leftmost, *longest* sequence of input characters that can match. Thus, in this example, all four `a' characters are replaced with `'. $ echo aaaabcd | awk '{ sub(/a+/, ""); print }' -| bcd For simple match/no-match tests, this is not so important. But when doing regexp-based field and record splitting, and text matching and substitutions with the `match', `sub', `gsub', and `gensub' functions, it is very important. *Note Built-in Functions for String Manipulation: String Functions, for more information on these functions. Understanding this principle is also important for regexp-based record and field splitting (*note How Input is Split into Records: Records., and also *note Specifying How Fields are Separated: Field Separators.). Using Dynamic Regexps ===================== The right hand side of a `~' or `!~' operator need not be a regexp constant (i.e. a string of characters between slashes). It may be any expression. The expression is evaluated, and converted if necessary to a string; the contents of the string are used as the regexp. A regexp that is computed in this way is called a "dynamic regexp". For example: BEGIN { identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+" } $0 ~ identifier_regexp { print } sets `identifier_regexp' to a regexp that describes `awk' variable names, and tests if the input record matches this regexp. *Caution:* When using the `~' and `!~' operators, there is a difference between a regexp constant enclosed in slashes, and a string constant enclosed in double quotes. If you are going to use a string constant, you have to understand that the string is in essence scanned *twice*; the first time when `awk' reads your program, and the second time when it goes to match the string on the left-hand side of the operator with the pattern on the right. This is true of any string valued expression (such as `identifier_regexp' above), not just string constants. What difference does it make if the string is scanned twice? The answer has to do with escape sequences, and particularly with backslashes. To get a backslash into a regular expression inside a string, you have to type two backslashes. For example, `/\*/' is a regexp constant for a literal `*'. Only one backslash is needed. To do the same thing with a string, you would have to type `"\\*"'. The first backslash escapes the second one, so that the string actually contains the two characters `\' and `*'. Given that you can use both regexp and string constants to describe regular expressions, which should you use? The answer is "regexp constants," for several reasons. 1. String constants are more complicated to write, and more difficult to read. Using regexp constants makes your programs less error-prone. Not understanding the difference between the two kinds of constants is a common source of errors. 2. It is also more efficient to use regexp constants: `awk' can note that you have supplied a regexp and store it internally in a form that makes pattern matching more efficient. When using a string constant, `awk' must first convert the string into this internal form, and then perform the pattern matching. 3. Using regexp constants is better style; it shows clearly that you intend a regexp match. Reading Input Files ******************* In the typical `awk' program, all input is read either from the standard input (by default the keyboard, but often a pipe from another command) or from files whose names you specify on the `awk' command line. If you specify input files, `awk' reads them in order, reading all the data from one before going on to the next. The name of the current input file can be found in the built-in variable `FILENAME' (*note Built-in Variables::.). The input is read in units called "records", and processed by the rules of your program one record at a time. By default, each record is one line. Each record is automatically split into chunks called "fields". This makes it more convenient for programs to work on the parts of a record. On rare occasions you will need to use the `getline' command. The `getline' command is valuable, both because it can do explicit input from any number of files, and because the files used with it do not have to be named on the `awk' command line (*note Explicit Input with `getline': Getline.). How Input is Split into Records =============================== The `awk' utility divides the input for your `awk' program into records and fields. Records are separated by a character called the "record separator". By default, the record separator is the newline character. This is why records are, by default, single lines. You can use a different character for the record separator by assigning the character to the built-in variable `RS'. You can change the value of `RS' in the `awk' program, like any other variable, with the assignment operator, `=' (*note Assignment Expressions: Assignment Ops.). The new record-separator character should be enclosed in quotation marks, which indicate a string constant. Often the right time to do this is at the beginning of execution, before any input has been processed, so that the very first record will be read with the proper separator. To do this, use the special `BEGIN' pattern (*note The `BEGIN' and `END' Special Patterns: BEGIN/END.). For example: awk 'BEGIN { RS = "/" } ; { print $0 }' BBS-list changes the value of `RS' to `"/"', before reading any input. This is a string whose first character is a slash; as a result, records are separated by slashes. Then the input file is read, and the second rule in the `awk' program (the action with no pattern) prints each record. Since each `print' statement adds a newline at the end of its output, the effect of this `awk' program is to copy the input with each slash changed to a newline. Here are the results of running the program on `BBS-list': $ awk 'BEGIN { RS = "/" } ; { print $0 }' BBS-list -| aardvark 555-5553 1200 -| 300 B -| alpo-net 555-3412 2400 -| 1200 -| 300 A -| barfly 555-7685 1200 -| 300 A -| bites 555-1675 2400 -| 1200 -| 300 A -| camelot 555-0542 300 C -| core 555-2912 1200 -| 300 C -| fooey 555-1234 2400 -| 1200 -| 300 B -| foot 555-6699 1200 -| 300 B -| macfoo 555-6480 1200 -| 300 A -| sdace 555-3430 2400 -| 1200 -| 300 A -| sabafoo 555-2127 1200 -| 300 C -| Note that the entry for the `camelot' BBS is not split. In the original data file (*note Data Files for the Examples: Sample Data Files.), the line looks like this: camelot 555-0542 300 C It only has one baud rate; there are no slashes in the record. Another way to change the record separator is on the command line, using the variable-assignment feature (*note Other Command Line Arguments: Other Arguments.). awk '{ print $0 }' RS="/" BBS-list This sets `RS' to `/' before processing `BBS-list'. Using an unusual character such as `/' for the record separator produces correct behavior in the vast majority of cases. However, the following (extreme) pipeline prints a surprising `1'. There is one field, consisting of a newline. The value of the built-in variable `NF' is the number of fields in the current record. $ echo | awk 'BEGIN { RS = "a" } ; { print NF }' -| 1 Reaching the end of an input file terminates the current input record, even if the last character in the file is not the character in `RS' (d.c.). The empty string, `""' (a string of no characters), has a special meaning as the value of `RS': it means that records are separated by one or more blank lines, and nothing else. *Note Multiple-Line Records: Multiple Line, for more details. If you change the value of `RS' in the middle of an `awk' run, the new value is used to delimit subsequent records, but the record currently being processed (and records already processed) are not affected. After the end of the record has been determined, `gawk' sets the variable `RT' to the text in the input that matched `RS'. The value of `RS' is in fact not limited to a one-character string. It can be any regular expression (*note Regular Expressions: Regexp.). In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string. This general rule is actually at work in the usual case, where `RS' contains just a newline: a record ends at the beginning of the next matching string (the next newline in the input) and the following record starts just after the end of this string (at the first character of the following line). The newline, since it matches `RS', is not part of either record. When `RS' is a single character, `RT' will contain the same single character. However, when `RS' is a regular expression, then `RT' becomes more useful; it contains the actual input text that matched the regular expression. The following example illustrates both of these features. It sets `RS' equal to a regular expression that matches either a newline, or a series of one or more upper-case letters with optional leading and/or trailing white space (*note Regular Expressions: Regexp.). $ echo record 1 AAAA record 2 BBBB record 3 | > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" } > { print "Record =", $0, "and RT =", RT }' -| Record = record 1 and RT = AAAA -| Record = record 2 and RT = BBBB -| Record = record 3 and RT = -| The final line of output has an extra blank line. This is because the value of `RT' is a newline, and then the `print' statement supplies its own terminating newline. *Note A Simple Stream Editor: Simple Sed, for a more useful example of `RS' as a regexp and `RT'. The use of `RS' as a regular expression and the `RT' variable are `gawk' extensions; they are not available in compatibility mode (*note Command Line Options: Options.). In compatibility mode, only the first character of the value of `RS' is used to determine the end of the record. The `awk' utility keeps track of the number of records that have been read so far from the current input file. This value is stored in a built-in variable called `FNR'. It is reset to zero when a new file is started. Another built-in variable, `NR', is the total number of input records read so far from all data files. It starts at zero but is never automatically reset to zero. Examining Fields ================ When `awk' reads an input record, the record is automatically separated or "parsed" by the interpreter into chunks called "fields". By default, fields are separated by whitespace, like words in a line. Whitespace in `awk' means any string of one or more spaces, tabs or newlines;(1) other characters such as formfeed, and so on, that are considered whitespace by other languages are *not* considered whitespace by `awk'. The purpose of fields is to make it more convenient for you to refer to these pieces of the record. You don't have to use them--you can operate on the whole record if you wish--but fields are what make simple `awk' programs so powerful. To refer to a field in an `awk' program, you use a dollar-sign, `$', followed by the number of the field you want. Thus, `$1' refers to the first field, `$2' to the second, and so on. For example, suppose the following is a line of input: This seems like a pretty nice example. Here the first field, or `$1', is `This'; the second field, or `$2', is `seems'; and so on. Note that the last field, `$7', is `example.'. Because there is no space between the `e' and the `.', the period is considered part of the seventh field. `NF' is a built-in variable whose value is the number of fields in the current record. `awk' updates the value of `NF' automatically, each time a record is read. No matter how many fields there are, the last field in a record can be represented by `$NF'. So, in the example above, `$NF' would be the same as `$7', which is `example.'. Why this works is explained below (*note Non-constant Field Numbers: Non-Constant Fields.). If you try to reference a field beyond the last one, such as `$8' when the record has only seven fields, you get the empty string. `$0', which looks like a reference to the "zeroth" field, is a special case: it represents the whole input record. `$0' is used when you are not interested in fields. Here are some more examples: $ awk '$1 ~ /foo/ { print $0 }' BBS-list -| fooey 555-1234 2400/1200/300 B -| foot 555-6699 1200/300 B -| macfoo 555-6480 1200/300 A -| sabafoo 555-2127 1200/300 C This example prints each record in the file `BBS-list' whose first field contains the string `foo'. The operator `~' is called a "matching operator" (*note How to Use Regular Expressions: Regexp Usage.); it tests whether a string (here, the field `$1') matches a given regular expression. By contrast, the following example looks for `foo' in *the entire record* and prints the first field and the last field for each input record containing a match. $ awk '/foo/ { print $1, $NF }' BBS-list -| fooey B -| foot B -| macfoo A -| sabafoo C ---------- Footnotes ---------- (1) In POSIX `awk', newlines are not considered whitespace for separating fields. Non-constant Field Numbers ========================== The number of a field does not need to be a constant. Any expression in the `awk' language can be used after a `$' to refer to a field. The value of the expression specifies the field number. If the value is a string, rather than a number, it is converted to a number. Consider this example: awk '{ print $NR }' Recall that `NR' is the number of records read so far: one in the first record, two in the second, etc. So this example prints the first field of the first record, the second field of the second record, and so on. For the twentieth record, field number 20 is printed; most likely, the record has fewer than 20 fields, so this prints a blank line. Here is another example of using expressions as field numbers: awk '{ print $(2*2) }' BBS-list `awk' must evaluate the expression `(2*2)' and use its value as the number of the field to print. The `*' sign represents multiplication, so the expression `2*2' evaluates to four. The parentheses are used so that the multiplication is done before the `$' operation; they are necessary whenever there is a binary operator in the field-number expression. This example, then, prints the hours of operation (the fourth field) for every line of the file `BBS-list'. (All of the `awk' operators are listed, in order of decreasing precedence, in *Note Operator Precedence (How Operators Nest): Precedence.) If the field number you compute is zero, you get the entire record. Thus, `$(2-2)' has the same value as `$0'. Negative field numbers are not allowed; trying to reference one will usually terminate your running `awk' program. (The POSIX standard does not define what happens when you reference a negative field number. `gawk' will notice this and terminate your program. Other `awk' implementations may behave differently.) As mentioned in *Note Examining Fields: Fields, the number of fields in the current record is stored in the built-in variable `NF' (also *note Built-in Variables::.). The expression `$NF' is not a special feature: it is the direct consequence of evaluating `NF' and using its value as a field number. Changing the Contents of a Field ================================ You can change the contents of a field as seen by `awk' within an `awk' program; this changes what `awk' perceives as the current input record. (The actual input is untouched; `awk' *never* modifies the input file.) Consider this example and its output: $ awk '{ $3 = $2 - 10; print $2, $3 }' inventory-shipped -| 13 3 -| 15 5 -| 15 5 ... The `-' sign represents subtraction, so this program reassigns field three, `$3', to be the value of field two minus ten, `$2 - 10'. (*Note Arithmetic Operators: Arithmetic Ops.) Then field two, and the new value for field three, are printed. In order for this to work, the text in field `$2' must make sense as a number; the string of characters must be converted to a number in order for the computer to do arithmetic on it. The number resulting from the subtraction is converted back to a string of characters which then becomes field three. *Note Conversion of Strings and Numbers: Conversion. When you change the value of a field (as perceived by `awk'), the text of the input record is recalculated to contain the new field where the old one was. Therefore, `$0' changes to reflect the altered field. Thus, this program prints a copy of the input file, with 10 subtracted from the second field of each line. $ awk '{ $2 = $2 - 10; print $0 }' inventory-shipped -| Jan 3 25 15 115 -| Feb 5 32 24 226 -| Mar 5 24 34 228 ... You can also assign contents to fields that are out of range. For example: $ awk '{ $6 = ($5 + $4 + $3 + $2) > print $6 }' inventory-shipped -| 168 -| 297 -| 301 ... We've just created `$6', whose value is the sum of fields `$2', `$3', `$4', and `$5'. The `+' sign represents addition. For the file `inventory-shipped', `$6' represents the total number of parcels shipped for a particular month. Creating a new field changes `awk''s internal copy of the current input record--the value of `$0'. Thus, if you do `print $0' after adding a field, the record printed includes the new field, with the appropriate number of field separators between it and the previously existing fields. This recomputation affects and is affected by `NF' (the number of fields; *note Examining Fields: Fields.), and by a feature that has not been discussed yet, the "output field separator", `OFS', which is used to separate the fields (*note Output Separators::.). For example, the value of `NF' is set to the number of the highest field you create. Note, however, that merely *referencing* an out-of-range field does *not* change the value of either `$0' or `NF'. Referencing an out-of-range field only produces an empty string. For example: if ($(NF+1) != "") print "can't happen" else print "everything is normal" should print `everything is normal', because `NF+1' is certain to be out of range. (*Note The `if'-`else' Statement: If Statement, for more information about `awk''s `if-else' statements. *Note Variable Typing and Comparison Expressions: Typing and Comparison, for more information about the `!=' operator.) It is important to note that making an assignment to an existing field will change the value of `$0', but will not change the value of `NF', even when you assign the empty string to a field. For example: $ echo a b c d | awk '{ OFS = ":"; $2 = "" > print $0; print NF }' -| a::c:d -| 4 The field is still there; it just has an empty value. You can tell because there are two colons in a row. This example shows what happens if you create a new field. $ echo a b c d | awk '{ OFS = ":"; $2 = ""; $6 = "new" > print $0; print NF }' -| a::c:d::new -| 6 The intervening field, `$5' is created with an empty value (indicated by the second pair of adjacent colons), and `NF' is updated with the value six. Finally, decrementing `NF' will lose the values of the fields after the new value of `NF', and `$0' will be recomputed. Here is an example: $ echo a b c d e f | ../gawk '{ print "NF =", NF; > NF = 3; print $0 }' -| NF = 6 -| a b c Specifying How Fields are Separated =================================== This section is rather long; it describes one of the most fundamental operations in `awk'. The Basics of Field Separating ------------------------------ The "field separator", which is either a single character or a regular expression, controls the way `awk' splits an input record into fields. `awk' scans the input record for character sequences that match the separator; the fields themselves are the text between the matches. In the examples below, we use the bullet symbol "*" to represent spaces in the output. If the field separator is `oo', then the following line: moo goo gai pan would be split into three fields: `m', `*g' and `*gai*pan'. Note the leading spaces in the values of the second and third fields. The field separator is represented by the built-in variable `FS'. Shell programmers take note! `awk' does *not* use the name `IFS' which is used by the POSIX compatible shells (such as the Bourne shell, `sh', or the GNU Bourne-Again Shell, Bash). You can change the value of `FS' in the `awk' program with the assignment operator, `=' (*note Assignment Expressions: Assignment Ops.). Often the right time to do this is at the beginning of execution, before any input has been processed, so that the very first record will be read with the proper separator. To do this, use the special `BEGIN' pattern (*note The `BEGIN' and `END' Special Patterns: BEGIN/END.). For example, here we set the value of `FS' to the string `","': awk 'BEGIN { FS = "," } ; { print $2 }' Given the input line, John Q. Smith, 29 Oak St., Walamazoo, MI 42139 this `awk' program extracts and prints the string `*29*Oak*St.'. Sometimes your input data will contain separator characters that don't separate fields the way you thought they would. For instance, the person's name in the example we just used might have a title or suffix attached, such as `John Q. Smith, LXIX'. From input containing such a name: John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139 the above program would extract `*LXIX', instead of `*29*Oak*St.'. If you were expecting the program to print the address, you would be surprised. The moral is: choose your data layout and separator characters carefully to prevent such problems. Normally, fields are separated by whitespace sequences (spaces, tabs and newlines), not by single spaces: two spaces in a row do not delimit an empty field. The default value of the field separator `FS' is a string containing a single space, `" "'. If this value were interpreted in the usual way, each space character would separate fields, so two spaces in a row would make an empty field between them. The reason this does not happen is that a single space as the value of `FS' is a special case: it is taken to specify the default manner of delimiting fields. If `FS' is any other single character, such as `","', then each occurrence of that character separates two fields. Two consecutive occurrences delimit an empty field. If the character occurs at the beginning or the end of the line, that too delimits an empty field. The space character is the only single character which does not follow these rules. Using Regular Expressions to Separate Fields -------------------------------------------- The previous node discussed the use of single characters or simple strings as the value of `FS'. More generally, the value of `FS' may be a string containing any regular expression. In this case, each match in the record for the regular expression separates fields. For example, the assignment: FS = ", \t" makes every area of an input line that consists of a comma followed by a space and a tab, into a field separator. (`\t' is an "escape sequence" that stands for a tab; *note Escape Sequences::., for the complete list of similar escape sequences.) For a less trivial example of a regular expression, suppose you want single spaces to separate fields the way single commas were used above. You can set `FS' to `"[ ]"' (left bracket, space, right bracket). This regular expression matches a single space and nothing else (*note Regular Expressions: Regexp.). There is an important difference between the two cases of `FS = " "' (a single space) and `FS = "[ \t\n]+"' (left bracket, space, backslash, "t", backslash, "n", right bracket, which is a regular expression matching one or more spaces, tabs, or newlines). For both values of `FS', fields are separated by runs of spaces, tabs and/or newlines. However, when the value of `FS' is `" "', `awk' will first strip leading and trailing whitespace from the record, and then decide where the fields are. For example, the following pipeline prints `b': $ echo ' a b c d ' | awk '{ print $2 }' -| b However, this pipeline prints `a' (note the extra spaces around each letter): $ echo ' a b c d ' | awk 'BEGIN { FS = "[ \t]+" } > { print $2 }' -| a In this case, the first field is "null", or empty. The stripping of leading and trailing whitespace also comes into play whenever `$0' is recomputed. For instance, study this pipeline: $ echo ' a b c d' | awk '{ print; $2 = $2; print }' -| a b c d -| a b c d The first `print' statement prints the record as it was read, with leading whitespace intact. The assignment to `$2' rebuilds `$0' by concatenating `$1' through `$NF' together, separated by the value of `OFS'. Since the leading whitespace was ignored when finding `$1', it is not part of the new `$0'. Finally, the last `print' statement prints the new `$0'. Making Each Character a Separate Field -------------------------------------- There are times when you may want to examine each character of a record separately. In `gawk', this is easy to do, you simply assign the null string (`""') to `FS'. In this case, each individual character in the record will become a separate field. Here is an example: $ echo a b | gawk 'BEGIN { FS = "" } > { > for (i = 1; i <= NF; i = i + 1) > print "Field", i, "is", $i > }' -| Field 1 is a -| Field 2 is -| Field 3 is b Traditionally, the behavior for `FS' equal to `""' was not defined. In this case, Unix `awk' would simply treat the entire record as only having one field (d.c.). In compatibility mode (*note Command Line Options: Options.), if `FS' is the null string, then `gawk' will also behave this way. Setting `FS' from the Command Line ---------------------------------- `FS' can be set on the command line. You use the `-F' option to do so. For example: awk -F, 'PROGRAM' INPUT-FILES sets `FS' to be the `,' character. Notice that the option uses a capital `F'. Contrast this with `-f', which specifies a file containing an `awk' program. Case is significant in command line options: the `-F' and `-f' options have nothing to do with each other. You can use both options at the same time to set the `FS' variable *and* get an `awk' program from a file. The value used for the argument to `-F' is processed in exactly the same way as assignments to the built-in variable `FS'. This means that if the field separator contains special characters, they must be escaped appropriately. For example, to use a `\' as the field separator, you would have to type: # same as FS = "\\" awk -F\\\\ '...' files ... Since `\' is used for quoting in the shell, `awk' will see `-F\\'. Then `awk' processes the `\\' for escape characters (*note Escape Sequences::.), finally yielding a single `\' to be used for the field separator. As a special case, in compatibility mode (*note Command Line Options: Options.), if the argument to `-F' is `t', then `FS' is set to the tab character. This is because if you type `-F\t' at the shell, without any quotes, the `\' gets deleted, so `awk' figures that you really want your fields to be separated with tabs, and not `t's. Use `-v FS="t"' on the command line if you really do want to separate your fields with `t's (*note Command Line Options: Options.). For example, let's use an `awk' program file called `baud.awk' that contains the pattern `/300/', and the action `print $1'. Here is the program: /300/ { print $1 } Let's also set `FS' to be the `-' character, and run the program on the file `BBS-list'. The following command prints a list of the names of the bulletin boards that operate at 300 baud and the first three digits of their phone numbers: $ awk -F- -f baud.awk BBS-list -| aardvark 555 -| alpo -| barfly 555 ... Note the second line of output. In the original file (*note Data Files for the Examples: Sample Data Files.), the second line looked like this: alpo-net 555-3412 2400/1200/300 A The `-' as part of the system's name was used as the field separator, instead of the `-' in the phone number that was originally intended. This demonstrates why you have to be careful in choosing your field and record separators. On many Unix systems, each user has a separate entry in the system password file, one line per user. The information in these lines is separated by colons. The first field is the user's logon name, and the second is the user's encrypted password. A password file entry might look like this: arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh The following program searches the system password file, and prints the entries for users who have no password: awk -F: '$2 == ""' /etc/passwd Field Splitting Summary ----------------------- According to the POSIX standard, `awk' is supposed to behave as if each record is split into fields at the time that it is read. In particular, this means that you can change the value of `FS' after a record is read, and the value of the fields (i.e. how they were split) should reflect the old value of `FS', not the new one. However, many implementations of `awk' do not work this way. Instead, they defer splitting the fields until a field is actually referenced. The fields will be split using the *current* value of `FS'! (d.c.) This behavior can be difficult to diagnose. The following example illustrates the difference between the two methods. (The `sed'(1) command prints just the first line of `/etc/passwd'.) sed 1q /etc/passwd | awk '{ FS = ":" ; print $1 }' will usually print root on an incorrect implementation of `awk', while `gawk' will print something like root:nSijPlPhZZwgE:0:0:Root:/: The following table summarizes how fields are split, based on the value of `FS'. (`==' means "is equal to.") `FS == " "' Fields are separated by runs of whitespace. Leading and trailing whitespace are ignored. This is the default. `FS == ANY OTHER SINGLE CHARACTER' Fields are separated by each occurrence of the character. Multiple successive occurrences delimit empty fields, as do leading and trailing occurrences. The character can even be a regexp metacharacter; it does not need to be escaped. `FS == REGEXP' Fields are separated by occurrences of characters that match REGEXP. Leading and trailing matches of REGEXP delimit empty fields. `FS == ""' Each individual character in the record becomes a separate field. ---------- Footnotes ---------- (1) The `sed' utility is a "stream editor." Its behavior is also defined by the POSIX standard. Reading Fixed-width Data ======================== (This section discusses an advanced, experimental feature. If you are a novice `awk' user, you may wish to skip it on the first reading.) `gawk' version 2.13 introduced a new facility for dealing with fixed-width fields with no distinctive field separator. Data of this nature arises, for example, in the input for old FORTRAN programs where numbers are run together; or in the output of programs that did not anticipate the use of their output as input for other programs. An example of the latter is a table where all the columns are lined up by the use of a variable number of spaces and *empty fields are just spaces*. Clearly, `awk''s normal field splitting based on `FS' will not work well in this case. Although a portable `awk' program can use a series of `substr' calls on `$0' (*note Built-in Functions for String Manipulation: String Functions.), this is awkward and inefficient for a large number of fields. The splitting of an input record into fixed-width fields is specified by assigning a string containing space-separated numbers to the built-in variable `FIELDWIDTHS'. Each number specifies the width of the field *including* columns between fields. If you want to ignore the columns between fields, you can specify the width as a separate field that is subsequently ignored. The following data is the output of the Unix `w' utility. It is useful to illustrate the use of `FIELDWIDTHS'. 10:06pm up 21 days, 14:04, 23 users User tty login idle JCPU PCPU what hzuo ttyV0 8:58pm 9 5 vi p24.tex hzang ttyV3 6:37pm 50 -csh eklye ttyV5 9:53pm 7 1 em thes.tex dportein ttyV6 8:17pm 1:47 -csh gierd ttyD3 10:00pm 1 elm dave ttyD4 9:47pm 4 4 w brent ttyp0 26Jun91 4:46 26:46 4:41 bash dave ttyq4 26Jun9115days 46 46 wnewmail The following program takes the above input, converts the idle time to number of seconds and prints out the first two fields and the calculated idle time. (This program uses a number of `awk' features that haven't been introduced yet.) BEGIN { FIELDWIDTHS = "9 6 10 6 7 7 35" } NR > 2 { idle = $4 sub(/^ */, "", idle) # strip leading spaces if (idle == "") idle = 0 if (idle ~ /:/) { split(idle, t, ":") idle = t[1] * 60 + t[2] } if (idle ~ /days/) idle *= 24 * 60 * 60 print $1, $2, idle } Here is the result of running the program on the data: hzuo ttyV0 0 hzang ttyV3 50 eklye ttyV5 0 dportein ttyV6 107 gierd ttyD3 1 dave ttyD4 0 brent ttyp0 286 dave ttyq4 1296000 Another (possibly more practical) example of fixed-width input data would be the input from a deck of balloting cards. In some parts of the United States, voters mark their choices by punching holes in computer cards. These cards are then processed to count the votes for any particular candidate or on any particular issue. Since a voter may choose not to vote on some issue, any column on the card may be empty. An `awk' program for processing such data could use the `FIELDWIDTHS' feature to simplify reading the data. (Of course, getting `gawk' to run on a system with card readers is another story!) Assigning a value to `FS' causes `gawk' to return to using `FS' for field splitting. Use `FS = FS' to make this happen, without having to know the current value of `FS'. This feature is still experimental, and may evolve over time. Note that in particular, `gawk' does not attempt to verify the sanity of the values used in the value of `FIELDWIDTHS'. Multiple-Line Records ===================== In some data bases, a single line cannot conveniently hold all the information in one entry. In such cases, you can use multi-line records. The first step in doing this is to choose your data format: when records are not defined as single lines, how do you want to define them? What should separate records? One technique is to use an unusual character or string to separate records. For example, you could use the formfeed character (written `\f' in `awk', as in C) to separate them, making each record a page of the file. To do this, just set the variable `RS' to `"\f"' (a string containing the formfeed character). Any other character could equally well be used, as long as it won't be part of the data in a record. Another technique is to have blank lines separate records. By a special dispensation, an empty string as the value of `RS' indicates that records are separated by one or more blank lines. If you set `RS' to the empty string, a record always ends at the first blank line encountered. And the next record doesn't start until the first non-blank line that follows--no matter how many blank lines appear in a row, they are considered one record-separator. You can achieve the same effect as `RS = ""' by assigning the string `"\n\n+"' to `RS'. This regexp matches the newline at the end of the record, and one or more blank lines after the record. In addition, a regular expression always matches the longest possible sequence when there is a choice (*note How Much Text Matches?: Leftmost Longest.). So the next record doesn't start until the first non-blank line that follows--no matter how many blank lines appear in a row, they are considered one record-separator. There is an important difference between `RS = ""' and `RS = "\n\n+"'. In the first case, leading newlines in the input data file are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done (d.c.). Now that the input is separated into records, the second step is to separate the fields in the record. One way to do this is to divide each of the lines into fields in the normal manner. This happens by default as the result of a special feature: when `RS' is set to the empty string, the newline character *always* acts as a field separator. This is in addition to whatever field separations result from `FS'. The original motivation for this special exception was probably to provide useful behavior in the default case (i.e. `FS' is equal to `" "'). This feature can be a problem if you really don't want the newline character to separate fields, since there is no way to prevent it. However, you can work around this by using the `split' function to break up the record manually (*note Built-in Functions for String Manipulation: String Functions.). Another way to separate fields is to put each field on a separate line: to do this, just set the variable `FS' to the string `"\n"'. (This simple regular expression matches a single newline.) A practical example of a data file organized this way might be a mailing list, where each entry is separated by blank lines. If we have a mailing list in a file named `addresses', that looks like this: Jane Doe 123 Main Street Anywhere, SE 12345-6789 John Smith 456 Tree-lined Avenue Smallville, MW 98765-4321 ... A simple program to process this file would look like this: # addrs.awk --- simple mailing list program # Records are separated by blank lines. # Each line is one field. BEGIN { RS = "" ; FS = "\n" } { print "Name is:", $1 print "Address is:", $2 print "City and State are:", $3 print "" } Running the program produces the following output: $ awk -f addrs.awk addresses -| Name is: Jane Doe -| Address is: 123 Main Street -| City and State are: Anywhere, SE 12345-6789 -| -| Name is: John Smith -| Address is: 456 Tree-lined Avenue -| City and State are: Smallville, MW 98765-4321 -| ... *Note Printing Mailing Labels: Labels Program, for a more realistic program that deals with address lists. The following table summarizes how records are split, based on the value of `RS'. (`==' means "is equal to.") `RS == "\n"' Records are separated by the newline character (`\n'). In effect, every line in the data file is a separate record, including blank lines. This is the default. `RS == ANY SINGLE CHARACTER' Records are separated by each occurrence of the character. Multiple successive occurrences delimit empty records. `RS == ""' Records are separated by runs of blank lines. The newline character always serves as a field separator, in addition to whatever value `FS' may have. Leading and trailing newlines in a file are ignored. `RS == REGEXP' Records are separated by occurrences of characters that match REGEXP. Leading and trailing matches of REGEXP delimit empty records. In all cases, `gawk' sets `RT' to the input text that matched the value specified by `RS'. Explicit Input with `getline' ============================= So far we have been getting our input data from `awk''s main input stream--either the standard input (usually your terminal, sometimes the output from another program) or from the files specified on the command line. The `awk' language has a special built-in command called `getline' that can be used to read input under your explicit control. Introduction to `getline' ------------------------- This command is used in several different ways, and should *not* be used by beginners. It is covered here because this is the chapter on input. The examples that follow the explanation of the `getline' command include material that has not been covered yet. Therefore, come back and study the `getline' command *after* you have reviewed the rest of this Info file and have a good knowledge of how `awk' works. `getline' returns one if it finds a record, and zero if the end of the file is encountered. If there is some error in getting a record, such as a file that cannot be opened, then `getline' returns -1. In this case, `gawk' sets the variable `ERRNO' to a string describing the error that occurred. In the following examples, COMMAND stands for a string value that represents a shell command. Using `getline' with No Arguments --------------------------------- The `getline' command can be used without arguments to read input from the current input file. All it does in this case is read the next input record and split it up into fields. This is useful if you've finished processing the current record, but you want to do some special processing *right now* on the next record. Here's an example: awk '{ if ((t = index($0, "/*")) != 0) { # value will be "" if t is 1 tmp = substr($0, 1, t - 1) u = index(substr($0, t + 2), "*/") while (u == 0) { if (getline <= 0) { m = "unexpected EOF or error" m = (m ": " ERRNO) print m > "/dev/stderr" exit } t = -1 u = index($0, "*/") } # substr expression will be "" if */ # occurred at end of line $0 = tmp substr($0, t + u + 3) } print $0 }' This `awk' program deletes all C-style comments, `/* ... */', from the input. By replacing the `print $0' with other statements, you could perform more complicated processing on the decommented input, like searching for matches of a regular expression. This program has a subtle problem--it does not work if one comment ends and another begins on the same line. This form of the `getline' command sets `NF' (the number of fields; *note Examining Fields: Fields.), `NR' (the number of records read so far; *note How Input is Split into Records: Records.), `FNR' (the number of records read from this input file), and the value of `$0'. *Note:* the new value of `$0' is used in testing the patterns of any subsequent rules. The original value of `$0' that triggered the rule which executed `getline' is lost (d.c.). By contrast, the `next' statement reads a new record but immediately begins processing it normally, starting with the first rule in the program. *Note The `next' Statement: Next Statement. Using `getline' Into a Variable ------------------------------- You can use `getline VAR' to read the next record from `awk''s input into the variable VAR. No other processing is done. For example, suppose the next line is a comment, or a special string, and you want to read it, without triggering any rules. This form of `getline' allows you to read that line and store it in a variable so that the main read-a-line-and-check-each-rule loop of `awk' never sees it. The following example swaps every two lines of input. For example, given: wan tew free phore it outputs: tew wan phore free Here's the program: awk '{ if ((getline tmp) > 0) { print tmp print $0 } else print $0 }' The `getline' command used in this way sets only the variables `NR' and `FNR' (and of course, VAR). The record is not split into fields, so the values of the fields (including `$0') and the value of `NF' do not change. Using `getline' from a File --------------------------- Use `getline < FILE' to read the next record from the file FILE. Here FILE is a string-valued expression that specifies the file name. `< FILE' is called a "redirection" since it directs input to come from a different place. For example, the following program reads its input record from the file `secondary.input' when it encounters a first field with a value equal to 10 in the current input file. awk '{ if ($1 == 10) { getline < "secondary.input" print } else print }' Since the main input stream is not used, the values of `NR' and `FNR' are not changed. But the record read is split into fields in the normal manner, so the values of `$0' and other fields are changed. So is the value of `NF'. According to POSIX, `getline < EXPRESSION' is ambiguous if EXPRESSION contains unparenthesized operators other than `$'; for example, `getline < dir "/" file' is ambiguous because the concatenation operator is not parenthesized, and you should write it as `getline < (dir "/" file)' if you want your program to be portable to other `awk' implementations. Using `getline' Into a Variable from a File ------------------------------------------- Use `getline VAR < FILE' to read input the file FILE and put it in the variable VAR. As above, FILE is a string-valued expression that specifies the file from which to read. In this version of `getline', none of the built-in variables are changed, and the record is not split into fields. The only variable changed is VAR. According to POSIX, `getline VAR < EXPRESSION' is ambiguous if EXPRESSION contains unparenthesized operators other than `$'; for example, `getline < dir "/" file' is ambiguous because the concatenation operator is not parenthesized, and you should write it as `getline < (dir "/" file)' if you want your program to be portable to other `awk' implementations. For example, the following program copies all the input files to the output, except for records that say `@include FILENAME'. Such a record is replaced by the contents of the file FILENAME. awk '{ if (NF == 2 && $1 == "@include") { while ((getline line < $2) > 0) print line close($2) } else print }' Note here how the name of the extra input file is not built into the program; it is taken directly from the data, from the second field on the `@include' line. The `close' function is called to ensure that if two identical `@include' lines appear in the input, the entire specified file is included twice. *Note Closing Input and Output Files and Pipes: Close Files And Pipes. One deficiency of this program is that it does not process nested `@include' statements (`@include' statements in included files) the way a true macro preprocessor would. *Note An Easy Way to Use Library Functions: Igawk Program, for a program that does handle nested `@include' statements. Using `getline' from a Pipe --------------------------- You can pipe the output of a command into `getline', using `COMMAND | getline'. In this case, the string COMMAND is run as a shell command and its output is piped into `awk' to be used as input. This form of `getline' reads one record at a time from the pipe. For example, the following program copies its input to its output, except for lines that begin with `@execute', which are replaced by the output produced by running the rest of the line as a shell command: awk '{ if ($1 == "@execute") { tmp = substr($0, 10) while ((tmp | getline) > 0) print close(tmp) } else print }' The `close' function is called to ensure that if two identical `@execute' lines appear in the input, the command is run for each one. *Note Closing Input and Output Files and Pipes: Close Files And Pipes. Given the input: foo bar baz @execute who bletch the program might produce: foo bar baz arnold ttyv0 Jul 13 14:22 miriam ttyp0 Jul 13 14:23 (murphy:0) bill ttyp1 Jul 13 14:23 (murphy:0) bletch Notice that this program ran the command `who' and printed the result. (If you try this program yourself, you will of course get different results, showing you who is logged in on your system.) This variation of `getline' splits the record into fields, sets the value of `NF' and recomputes the value of `$0'. The values of `NR' and `FNR' are not changed. According to POSIX, `EXPRESSION | getline' is ambiguous if EXPRESSION contains unparenthesized operators other than `$'; for example, `"echo " "date" | getline' is ambiguous because the concatenation operator is not parenthesized, and you should write it as `("echo " "date") | getline' if you want your program to be portable to other `awk' implementations. Using `getline' Into a Variable from a Pipe ------------------------------------------- When you use `COMMAND | getline VAR', the output of the command COMMAND is sent through a pipe to `getline' and into the variable VAR. For example, the following program reads the current date and time into the variable `current_time', using the `date' utility, and then prints it. awk 'BEGIN { "date" | getline current_time close("date") print "Report printed on " current_time }' In this version of `getline', none of the built-in variables are changed, and the record is not split into fields. According to POSIX, `EXPRESSION | getline VAR' is ambiguous if EXPRESSION contains unparenthesized operators other than `$'; for example, `"echo " "date" | getline VAR' is ambiguous because the concatenation operator is not parenthesized, and you should write it as `("echo " "date") | getline VAR' if you want your program to be portable to other `awk' implementations. Summary of `getline' Variants ----------------------------- With all the forms of `getline', even though `$0' and `NF', may be updated, the record will not be tested against all the patterns in the `awk' program, in the way that would happen if the record were read normally by the main processing loop of `awk'. However the new record is tested against any subsequent rules. Many `awk' implementations limit the number of pipelines an `awk' program may have open to just one! In `gawk', there is no such limit. You can open as many pipelines as the underlying operating system will permit. An interesting side-effect occurs if you use `getline' (without a redirection) inside a `BEGIN' rule. Since an unredirected `getline' reads from the command line data files, the first `getline' command causes `awk' to set the value of `FILENAME'. Normally, `FILENAME' does not have a value inside `BEGIN' rules, since you have not yet started to process the command line data files (d.c.). (*Note The `BEGIN' and `END' Special Patterns: BEGIN/END, also *note Built-in Variables that Convey Information: Auto-set..) The following table summarizes the six variants of `getline', listing which built-in variables are set by each one. `getline' sets `$0', `NF', `FNR', and `NR'. `getline VAR' sets VAR, `FNR', and `NR'. `getline < FILE' sets `$0', and `NF'. `getline VAR < FILE' sets VAR. `COMMAND | getline' sets `$0', and `NF'. `COMMAND | getline VAR' sets VAR. Printing Output *************** One of the most common actions is to "print", or output, some or all of the input. You use the `print' statement for simple output. You use the `printf' statement for fancier formatting. Both are described in this chapter. The `print' Statement ===================== The `print' statement does output with simple, standardized formatting. You specify only the strings or numbers to be printed, in a list separated by commas. They are output, separated by single spaces, followed by a newline. The statement looks like this: print ITEM1, ITEM2, ... The entire list of items may optionally be enclosed in parentheses. The parentheses are necessary if any of the item expressions uses the `>' relational operator; otherwise it could be confused with a redirection (*note Redirecting Output of `print' and `printf': Redirection.). The items to be printed can be constant strings or numbers, fields of the current record (such as `$1'), variables, or any `awk' expressions. Numeric values are converted to strings, and then printed. The `print' statement is completely general for computing *what* values to print. However, with two exceptions, you cannot specify *how* to print them--how many columns, whether to use exponential notation or not, and so on. (For the exceptions, *note Output Separators::., and *Note Controlling Numeric Output with `print': OFMT.) For that, you need the `printf' statement (*note Using `printf' Statements for Fancier Printing: Printf.). The simple statement `print' with no items is equivalent to `print $0': it prints the entire current record. To print a blank line, use `print ""', where `""' is the empty string. To print a fixed piece of text, use a string constant such as `"Don't Panic"' as one item. If you forget to use the double-quote characters, your text will be taken as an `awk' expression, and you will probably get an error. Keep in mind that a space is printed between any two items. Each `print' statement makes at least one line of output. But it isn't limited to one line. If an item value is a string that contains a newline, the newline is output along with the rest of the string. A single `print' can make any number of lines this way. Examples of `print' Statements ============================== Here is an example of printing a string that contains embedded newlines (the `\n' is an escape sequence, used to represent the newline character; see *Note Escape Sequences::): $ awk 'BEGIN { print "line one\nline two\nline three" }' -| line one -| line two -| line three Here is an example that prints the first two fields of each input record, with a space between them: $ awk '{ print $1, $2 }' inventory-shipped -| Jan 13 -| Feb 15 -| Mar 15 ... A common mistake in using the `print' statement is to omit the comma between two items. This often has the effect of making the items run together in the output, with no space. The reason for this is that juxtaposing two string expressions in `awk' means to concatenate them. Here is the same program, without the comma: $ awk '{ print $1 $2 }' inventory-shipped -| Jan13 -| Feb15 -| Mar15 ... To someone unfamiliar with the file `inventory-shipped', neither example's output makes much sense. A heading line at the beginning would make it clearer. Let's add some headings to our table of months (`$1') and green crates shipped (`$2'). We do this using the `BEGIN' pattern (*note The `BEGIN' and `END' Special Patterns: BEGIN/END.) to force the headings to be printed only once: awk 'BEGIN { print "Month Crates" print "----- ------" } { print $1, $2 }' inventory-shipped Did you already guess what happens? When run, the program prints the following: Month Crates ----- ------ Jan 13 Feb 15 Mar 15 ... The headings and the table data don't line up! We can fix this by printing some spaces between the two fields: awk 'BEGIN { print "Month Crates" print "----- ------" } { print $1, " ", $2 }' inventory-shipped You can imagine that this way of lining up columns can get pretty complicated when you have many columns to fix. Counting spaces for two or three columns can be simple, but more than this and you can get lost quite easily. This is why the `printf' statement was created (*note Using `printf' Statements for Fancier Printing: Printf.); one of its specialties is lining up columns of data. As a side point, you can continue either a `print' or `printf' statement simply by putting a newline after any comma (*note `awk' Statements Versus Lines: Statements/Lines.). Output Separators ================= As mentioned previously, a `print' statement contains a list of items, separated by commas. In the output, the items are normally separated by single spaces. This need not be the case; a single space is only the default. You can specify any string of characters to use as the "output field sepa