This file documents awk, a program that you can use to select
particular records in a file and perform operations upon them.
Copyright © 1989, 1991, 1992, 1993, 1996, 1997, 1998, 1999, 2000, 2001, 2002 Free Software Foundation, Inc.
This is Edition 3 of GAWK: Effective AWK Programming: A User's Guide for GNU Awk, for the 3.1.1 (or later) version of the GNU implementation of AWK.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being "GNU General Public License", the Front-Cover texts being (a) (see below), and with the Back-Cover Texts being (b) (see below). A copy of the license is included in the section entitled "GNU Free Documentation License".
awk. How to run an awk
program. Command-line syntax.
awk. Describes
the print and printf
statements. Also describes redirection of
output.
gawk to speak your
language.
gawk.
gawk.
awk Functions.
awk programs with complete
explanations.
awk
language.
gawk under various
operating systems.
gawk extensions and
possible future work.
gawk.
gawk and
awk.
awk.
gawk programs;
includes command-line syntax.
awk
program.
awk programs in
files.
awk
programs.
gawk
programs.
awk programs illustrated in this
Web page.
awk.
gawk and when to use
other things.
[...].
FS from the command-line.
getline function.
getline with no arguments.
getline into a variable.
getline from a file.
getline into a variable from a
file.
getline from a pipe.
getline into a variable from a
pipe.
getline from a coprocess.
getline into a variable from a
coprocess.
getline.
getline Variants.
print statement.
print statements.
print.
printf statement.
printf statement.
gawk.
gawk allows access to inherited
file descriptors.
+, -,
etc.)
<, etc.
|| (``or''),
&& (``and'') and ! (``not'').
awk.
awk
statements.
awk.
awk.
awk
gives you information.
ARGC and ARGV.
for statement. It
loops through the indices of an array's
existing elements.
delete statement removes an
element from an array.
awk.
awk.
int, sin and rand.
split, match and
sprintf.
\
and & with sub, gsub,
and gensub.
gettext works.
printf arguments.
awk-level portability issues.
gawk is also internationalized.
gawk for network
programming.
gawk with BSD portals.
awk programs.
awk.
awk
programs.
gawk.
nextfile
function.
awk
programs.
sprintf
does not do it correctly.
cut utility.
egrep utility.
id utility.
split utility.
tee utility.
uniq utility.
wc utility.
awk programs.
tr
utility.
awk that includes
files.
awk.
gawk not in
POSIX awk.
gawk.
gawk distribution.
gawk under various
versions of Unix.
gawk under Unix.
gawk on an Amiga.
gawk on BeOS.
gawk on
MS-DOS and OS/2.
gawk for MS-DOS, Win32,
and OS/2.
gawk on MS-DOS, Win32 and
OS/2.
gawk for
Cygwin.
gawk on VMS.
gawk under VMS.
gawk under VMS.
gawk under VMS.
gawk on the Atari ST.
gawk on Atari.
gawk on Atari.
gawk on a Tandem.
awk
implementations.
gawk
extensions.
gawk.
gawk.
gawk to a new operating
system.
gawk.
gawk
internals.
awk
awk
gawk
gawk
awk and gawk
awk Functions
awk Programs
awk Language
gawk
awk
getline
getline with No Arguments
getline into a Variable
getline from a File
getline into a Variable from a File
getline from a Pipe
getline into a Variable from a Pipe
getline from a Coprocess
getline into a Variable from a Coprocess
getline
getline Variants
awk
awk
delete Statement
gawk
gawk
gawk
awk and gawk
awk Functions
awk Programs
awk Language
gawk
gawk Distribution
gawk on Unix
awk Implementations
Arnold Robbins and I are good friends. We were introduced 11 years ago
by circumstances--and our favorite programming language, AWK.
The circumstances started a couple of years
earlier. I was working at a new job and noticed an unplugged
Unix computer sitting in the corner. No one knew how to use it,
and neither did I. However,
a couple of days later it was running, and
I was root and the one-and-only user.
That day, I began the transition from statistician to Unix programmer.
On one of many trips to the library or bookstore in search of books on Unix, I found the gray AWK book, a.k.a. Aho, Kernighan and Weinberger, The AWK Programming Language, Addison-Wesley, 1988. AWK's simple programming paradigm--find a pattern in the input and then perform an action--often reduced complex or tedious data manipulations to few lines of code. I was excited to try my hand at programming in AWK.
Alas, the awk on my computer was a limited version of the
language described in the AWK book. I discovered that my computer
had "old awk" and the AWK book described "new awk."
I learned that this was typical; the old version refused to step
aside or relinquish its name. If a system had a new awk, it was
invariably called nawk, and few systems had it.
The best way to get a new awk was to ftp the source code for
gawk from prep.ai.mit.edu. gawk was a version of
new awk written by David Trueman and Arnold, and available under
the GNU General Public License.
(Incidentally,
it's no longer difficult to find a new awk. gawk ships with
Linux, and you can download binaries or source code for almost
any system; my wife uses gawk on her VMS box.)
My Unix system started out unplugged from the wall; it certainly was not
plugged into a network. So, oblivious to the existence of gawk
and the Unix community in general, and desiring a new awk, I wrote
my own, called mawk.
Before I was finished I knew about gawk,
but it was too late to stop, so I eventually posted
to a comp.sources newsgroup.
A few days after my posting, I got a friendly email
from Arnold introducing
himself. He suggested we share design and algorithms and
attached a draft of the POSIX standard so
that I could update mawk to support language extensions added
after publication of the AWK book.
Frankly, if our roles had been reversed, I would not have been so open and we probably would have never met. I'm glad we did meet. He is an AWK expert's AWK expert and a genuinely nice person. Arnold contributes significant amounts of his expertise and time to the Free Software Foundation.
This book is the gawk reference manual, but at its core it
is a book about AWK programming that
will appeal to a wide audience.
It is a definitive reference to the AWK language as defined by the
1987 Bell Labs release and codified in the 1992 POSIX Utilities
standard.
On the other hand, the novice AWK programmer can study
a wealth of practical programs that emphasize
the power of AWK's basic idioms:
data driven control-flow, pattern matching with regular expressions,
and associative arrays.
Those looking for something new can try out gawk's
interface to network protocols via special /inet files.
The programs in this book make clear that an AWK program is typically much smaller and faster to develop than a counterpart written in C. Consequently, there is often a payoff to prototype an algorithm or design in AWK to get it running quickly and expose problems early. Often, the interpreted performance is adequate and the AWK prototype becomes the product.
The new pgawk (profiling gawk), produces
program execution counts.
I recently experimented with an algorithm that for
n lines of input, exhibited
~ C n^2
performance, while
theory predicted
~ C n log n
behavior. A few minutes poring
over the awkprof.out profile pinpointed the problem to
a single line of code. pgawk is a welcome addition to
my programmer's toolbox.
Arnold has distilled over a decade of experience writing and
using AWK programs, and developing gawk, into this book. If you use
AWK or want to learn how, then read this book.
Michael Brennan
Author of mawk
Several kinds of tasks occur repeatedly
when working with text files.
You might want to extract certain lines and discard the rest.
Or you may need to make changes wherever certain patterns appear,
but leave the rest of the file alone.
Writing single-use programs for these tasks in languages such as C, C++, or Pascal
is time-consuming and inconvenient.
Such jobs are often easier with awk.
The awk utility interprets a special-purpose programming language
that makes it easy to handle simple data-reformatting jobs.
The GNU implementation of awk is called gawk; it is fully
compatible with the System V Release 4 version of
awk. gawk is also compatible with the POSIX
specification of the awk language. This means that all
properly written awk programs should work with gawk.
Thus, we usually don't distinguish between gawk and other
awk implementations.
Using awk allows you to:
In addition,
gawk
provides facilities that make it easy to:
This Web page teaches you about the awk language and
how you can use it effectively. You should already be familiar with basic
system commands, such as cat and ls,1 as well as basic shell
facilities, such as input/output (I/O) redirection and pipes.
Implementations of the awk language are available for many
different computing environments. This Web page, while describing
the awk language in general, also describes the particular
implementation of awk called gawk (which stands for
"GNU awk"). gawk runs on a broad range of Unix systems,
ranging from 80386 PC-based computers up through large-scale systems,
such as Crays. gawk has also been ported to Mac OS X,
MS-DOS, Microsoft Windows (all versions) and OS/2 PCs, Atari and Amiga
microcomputers, BeOS, Tandem D20, and VMS.
gawk and
awk.
awk.
awk and gawk1 part egrep | 1 part snobol
| |
2 parts ed | 3 parts C
|
Blend all parts well usinglexandyacc. Document minimally and release.After eight years, add another part
egrepand two more parts C. Document very well and release.
The name awk comes from the initials of its designers: Alfred V.
Aho, Peter J. Weinberger and Brian W. Kernighan. The original version of
awk was written in 1977 at AT&T Bell Laboratories.
In 1985, a new version made the programming
language more powerful, introducing user-defined functions, multiple input
streams, and computed regular expressions.
This new version became widely available with Unix System V
Release 3.1 (SVR3.1).
The version in SVR4 added some new features and cleaned
up the behavior in some of the "dark corners" of the language.
The specification for awk in the POSIX Command Language
and Utilities standard further clarified the language.
Both the gawk designers and the original Bell Laboratories awk
designers provided feedback for the POSIX specification.
Paul Rubin wrote the GNU implementation, gawk, in 1986.
Jay Fenlason completed it, with advice from Richard Stallman. John Woods
contributed parts of the code as well. In 1988 and 1989, David Trueman, with
help from me, thoroughly reworked gawk for compatibility
with the newer awk.
Circa 1995, I became the primary maintainer.
Current development focuses on bug fixes,
performance improvements, standards compliance, and occasionally, new features.
In May of 1997, Jürgen Kahrs felt the need for network access
from awk, and with a little help from me, set about adding
features to do this for gawk. At that time, he also
wrote the bulk of
TCP/IP Internetworking with gawk
(a separate document, available as part of the gawk distribution).
His code finally became part of the main gawk distribution
with gawk version 3.1.
See Major Contributors to gawk,
for a complete list of those who made important contributions to gawk.
The awk language has evolved over the years. Full details are
provided in The Evolution of the awk Language.
The language described in this Web page
is often referred to as "new awk" (nawk).
Because of this, many systems have multiple
versions of awk.
Some systems have an awk utility that implements the
original version of the awk language and a nawk utility
for the new
version.
Others have an oawk version for the "old awk"
language and plain awk for the new one. Still others only
have one version, which is usually the new one.2
All in all, this makes it difficult for you to know which version of
awk you should run when writing your programs. The best advice
I can give here is to check your local documentation. Look for awk,
oawk, and nawk, as well as for gawk.
It is likely that you already
have some version of new awk on your system, which is what
you should use when running your programs. (Of course, if you're reading
this Web page, chances are good that you have gawk!)
Throughout this Web page, whenever we refer to a language feature
that should be available in any complete implementation of POSIX awk,
we simply use the term awk. When referring to a feature that is
specific to the GNU implementation, we use the term gawk.
The term awk refers to a particular program as well as to the language you
use to tell this program what to do. When we need to be careful, we call
the language "the awk language,"
and the program "the awk utility."
This Web page explains
both the awk language and how to run the awk utility.
The term awk program refers to a program written by you in
the awk programming language.
Primarily, this Web page explains the features of awk,
as defined in the POSIX standard. It does so in the context of the
gawk implementation. While doing so, it also
attempts to describe important differences between gawk
and other awk implementations.3
Finally, any gawk features that are not in
the POSIX standard for awk are noted.
This Web page has the difficult task of being both a tutorial and a reference. If you are a novice, feel free to skip over details that seem too complex. You should also ignore the many cross-references; they are for the expert user and for the online Info version of the document.
There are subsections labelled as Advanced Notes scattered throughout the Web page. They add a more complete explanation of points that are relevant, but not likely to be of interest on first reading. All appear in the index, under the heading "advanced features."
Most of the time, the examples use complete awk programs.
In some of the more advanced sections, only the part of the awk
program that illustrates the concept currently being described is shown.
While this Web page is aimed principally at people who have not been
exposed
to awk, there is a lot of information here that even the awk
expert should find useful. In particular, the description of POSIX
awk and the example programs in
A Library of awk Functions, and in
Practical awk Programs,
should be of interest.
Getting Started with awk,
provides the essentials you need to know to begin using awk.
Regular Expressions,
introduces regular expressions in general, and in particular the flavors
supported by POSIX awk and gawk.
Reading Input Files,
describes how awk reads your data.
It introduces the concepts of records and fields, as well
as the getline command.
I/O redirection is first described here.
Printing Output,
describes how awk programs can produce output with
print and printf.
Expressions,
describes expressions, which are the basic building blocks
for getting most things done in a program.
Patterns Actions and Variables,
describes how to write patterns for matching records, actions for
doing something when a record is matched, and the built-in variables
awk and gawk use.
Arrays in awk,
covers awk's one-and-only data structure: associative arrays.
Deleting array elements and whole arrays is also described, as well as
sorting arrays in gawk.
Functions,
describes the built-in functions awk and
gawk provide, as well as how to define
your own functions.
Internationalization with gawk,
describes special features in gawk for translating program
messages into different languages at runtime.
Advanced Features of gawk,
describes a number of gawk-specific advanced features.
Of particular note
are the abilities to have two-way communications with another process,
perform TCP/IP networking, and
profile your awk programs.
Running awk and gawk,
describes how to run gawk, the meaning of its
command-line options, and how it finds awk
program source files.
A Library of awk Functions, and
Practical awk Programs,
provide many sample awk programs.
Reading them allows you to see awk
solving real problems.
The Evolution of the awk Language,
describes how the awk language has evolved since
first release to present. It also describes how gawk
has acquired features over time.
Installing gawk,
describes how to get gawk, how to compile it
under Unix, and how to compile and use it on different
non-Unix systems. It also describes how to report bugs
in gawk and where to get three other freely
available implementations of awk.
Implementation Notes,
describes how to disable gawk's extensions, as
well as how to contribute new code to gawk,
how to write extension libraries, and some possible
future directions for gawk development.
Basic Programming Concepts,
provides some very cursory background material for those who
are completely unfamiliar with computer programming.
Also centralized there is a discussion of some of the issues
surrounding floating-point numbers.
The
Glossary,
defines most, if not all, the significant terms used
throughout the book.
If you find terms that you aren't familiar with, try looking them up here.
GNU General Public License, and
GNU Free Documentation License,
present the licenses that cover the gawk source code
and this Web page, respectively.
This Web page is written using Texinfo, the GNU documentation formatting language. A single Texinfo source file is used to produce both the printed and online versions of the documentation. This section briefly documents the typographical conventions used in Texinfo.
Examples you would type at the command-line are preceded by the common
shell primary and secondary prompts, $ and >.
Output from the command is preceded by the glyph "-|".
This typically represents the command's standard output.
Error messages, and other output on the command's standard error, are preceded
by the glyph "error-->". For example:
$ echo hi on stdout -| hi on stdout $ echo hello on stderr 1>&2 error--> hello on stderr
Characters that you type at the keyboard look like this. In particular, there are special characters called "control characters." These are characters that you type by holding down both the CONTROL key and another key, at the same time. For example, a Ctrl-d is typed by first pressing and holding the CONTROL key, next pressing the d key and finally releasing both keys.
Dark corners are basically fractal -- no matter how much you illuminate, there's always a smaller but darker one.
Brian Kernighan
Until the POSIX standard (and The Gawk Manual),
many features of awk were either poorly documented or not
documented at all. Descriptions of such features
(often called "dark corners") are noted in this Web page with
"(d.c.)".
They also appear in the index under the heading "dark corner."
As noted by the opening quote, though, any coverage of dark corners is, by definition, something that is incomplete.
The Free Software Foundation (FSF) is a nonprofit organization dedicated to the production and distribution of freely distributable software. It was founded by Richard M. Stallman, the author of the original Emacs editor. GNU Emacs is the most widely used version of Emacs today.
The GNU4
Project is an ongoing effort on the part of the Free Software
Foundation to create a complete, freely distributable, POSIX-compliant
computing environment.
The FSF uses the "GNU General Public License" (GPL) to ensure that
their software's
source code is always available to the end user. A
copy of the GPL is included
in this Web page
for your reference
(see GNU General Public License).
The GPL applies to the C language source code for gawk.
To find out more about the FSF and the GNU Project online,
see the GNU Project's home page.
This Web page may also be read from
their web site.
A shell, an editor (Emacs), highly portable optimizing C, C++, and
Objective-C compilers, a symbolic debugger and dozens of large and
small utilities (such as gawk), have all been completed and are
freely available. The GNU operating
system kernel (the HURD), has been released but is still in an early
stage of development.
Until the GNU operating system is more fully developed, you should
consider using GNU/Linux, a freely distributable, Unix-like operating
system for Intel 80386, DEC Alpha, Sun SPARC, IBM S/390, and other
systems.5
There are
many books on GNU/Linux. One that is freely available is Linux
Installation and Getting Started, by Matt Welsh.
Many GNU/Linux distributions are often available in computer stores or
bundled on CD-ROMs with books about Linux.
(There are three other freely available, Unix-like operating systems for
80386 and other systems: NetBSD, FreeBSD, and OpenBSD. All are based on the
4.4-Lite Berkeley Software Distribution, and they use recent versions
of gawk for their versions of awk.)
The Web page you are reading is actually free--at least, the
information in it is free to anyone. The machine-readable
source code for the Web page comes with gawk; anyone
may take this Web page to a copying machine and make as many
copies as they like. (Take a moment to check the Free Documentation
License in GNU Free Documentation License.)
Although you could just print it out yourself, bound books are much easier to read and use. Furthermore, the proceeds from sales of this book go back to the FSF to help fund development of more free software.
The Web page itself has gone through a number of previous editions.
Paul Rubin wrote the very first draft of The GAWK Manual;
it was around 40 pages in size.
Diane Close and Richard Stallman improved it, yielding a
version that was
around 90 pages long and barely described the original, "old"
version of awk.
I started working with that version in the fall of 1988.
As work on it progressed,
the FSF published several preliminary versions (numbered 0.x).
In 1996, Edition 1.0 was released with gawk 3.0.0.
The FSF published the first two editions under
the title The GNU Awk User's Guide.
This edition maintains the basic structure of Edition 1.0,
but with significant additional material, reflecting the host of new features
in gawk version 3.1.
Of particular note is
Sorting Array Values and Indices with gawk,
as well as
Using gawk's Bit Manipulation Functions,
Internationalization with gawk,
and also
Advanced Features of gawk,
and
Adding New Built-in Functions to gawk.
GAWK: Effective AWK Programming will undoubtedly continue to evolve.
An electronic version
comes with the gawk distribution from the FSF.
If you find an error in this Web page, please report it!
See Reporting Problems and Bugs, for information on submitting
problem reports electronically, or write to me in care of the publisher.
As the maintainer of GNU awk,
I am starting a collection of publicly available awk
programs.
For more information,
see ftp://ftp.freefriends.org/arnold/Awkstuff.
If you have written an interesting awk program, or have written a
gawk extension that you would like to
share with the rest of the world, please contact me (arnold@gnu.org).
Making things available on the Internet helps keep the
gawk distribution down to manageable size.
The initial draft of The GAWK Manual had the following acknowledgments:
Many people need to be thanked for their assistance in producing this manual. Jay Fenlason contributed many ideas and sample programs. Richard Mlynarik and Robert Chassell gave helpful comments on drafts of this manual. The paper A Supplemental Document forawkby John W. Pierce of the Chemistry Department at UC San Diego, pinpointed several issues relevant both toawkimplementation and to this manual, that would otherwise have escaped us.
I would like to acknowledge Richard M. Stallman, for his vision of a better world and for his courage in founding the FSF and starting the GNU Project.
The following people (in alphabetical order) provided helpful comments on various versions of this book, up to and including this edition. Rick Adams, Nelson H.F. Beebe, Karl Berry, Dr. Michael Brennan, Rich Burridge, Claire Cloutier, Diane Close, Scott Deifik, Christopher ("Topher") Eliot, Jeffrey Friedl, Dr. Darrel Hankerson, Michal Jaegermann, Dr. Richard J. LeBlanc, Michael Lijewski, Pat Rankin, Miriam Robbins, Mary Sheehan, and Chuck Toporek.
Robert J. Chassell provided much valuable advice on the use of Texinfo. He also deserves special thanks for convincing me not to title this Web page How To Gawk Politely. Karl Berry helped significantly with the TeX part of Texinfo.
I would like to thank Marshall and Elaine Hartholz of Seattle and
Dr. Bert and Rita Schreiber of Detroit for large amounts of quiet vacation
time in their homes, which allowed me to make significant progress on
this Web page and on gawk itself.
Phil Hughes of SSC contributed in a very important way by loaning me his laptop GNU/Linux system, not once, but twice, which allowed me to do a lot of work while away from home.
David Trueman deserves special credit; he has done a yeoman job
of evolving gawk so that it performs well and without bugs.
Although he is no longer involved with gawk,
working with him on this project was a significant pleasure.
The intrepid members of the GNITS mailing list, and most notably Ulrich Drepper, provided invaluable help and feedback for the design of the internationalization features.
Nelson Beebe,
Martin Brown,
Andreas Buening,
Scott Deifik,
Darrel Hankerson,
Isamu Hasegawa,
Michal Jaegermann,
Jürgen Kahrs,
Pat Rankin,
Kai Uwe Rommel,
and Eli Zaretskii
(in alphabetical order)
make up the
gawk "crack portability team." Without their hard work and
help, gawk would not be nearly the fine program it is today. It
has been and continues to be a pleasure working with this team of fine
people.
David and I would like to thank Brian Kernighan of Bell Laboratories for
invaluable assistance during the testing and debugging of gawk, and for
help in clarifying numerous points about the language. We could not have
done nearly as good a job on either gawk or its documentation without
his help.
Chuck Toporek, Mary Sheehan, and Claire Coutier of O'Reilly & Associates contributed
significant editorial help for this Web page for the
3.1 release of gawk.
I must thank my wonderful wife, Miriam, for her patience through
the many versions of this project, for her proofreading,
and for sharing me with the computer.
I would like to thank my parents for their love, and for the grace with
which they raised and educated me.
Finally, I also must acknowledge my gratitude to G-d, for the many opportunities
He has sent my way, as well as for the gifts He has given me with which to
take advantage of those opportunities.
Arnold Robbins
Nof Ayalon
ISRAEL
March, 2001
awkThe basic function of awk is to search files for lines (or other
units of text) that contain certain patterns. When a line matches one
of the patterns, awk performs specified actions on that line.
awk keeps processing input lines in this way until it reaches
the end of the input files.
Programs in awk are different from programs in most other languages,
because awk programs are data-driven; that is, you describe
the data you want to work with and then what to do when you find it.
Most other languages are procedural; you have to describe, in great
detail, every step the program is to take. When working with procedural
languages, it is usually much
harder to clearly describe the data your program will process.
For this reason, awk programs are often refreshingly easy to
read and write.
When you run awk, you specify an awk program that
tells awk what to do. The program consists of a series of
rules. (It may also contain function definitions,
an advanced feature that we will ignore for now.
See User-Defined Functions.) Each rule specifies one
pattern to search for and one action to perform
upon finding the pattern.
Syntactically, a rule consists of a pattern followed by an action. The
action is enclosed in curly braces to separate it from the pattern.
Newlines usually separate rules. Therefore, an awk
program looks like this:
pattern { action }
pattern { action }
...
gawk programs; includes
command-line syntax.
awk
programs illustrated in this Web page.
awk.
gawk and when to use
other things.
awk ProgramsThere are several ways to run an awk program. If the program is
short, it is easiest to include it in the command that runs awk,
like this:
awk 'program' input-file1 input-file2 ...
When the program is long, it is usually more convenient to put it in a file
and run it with a command like this:
awk -f program-file input-file1 input-file2 ...
This section discusses both mechanisms, along with several variations of each.
awk
program.
awk programs in
files.
awk programs.
gawk
programs.
awk ProgramsOnce you are familiar with awk, you will often type in simple
programs the moment you want to use them. Then you can write the
program as the first argument of the awk command, like this:
awk 'program' input-file1 input-file2 ...
where program consists of a series of patterns and actions, as described earlier.
This command format instructs the shell, or command interpreter,
to start awk and use the program to process records in the
input file(s). There are single quotes around program so
the shell won't interpret any awk characters as special shell
characters. The quotes also cause the shell to treat all of program as
a single argument for awk, and allow program to be more
than one line long.
This format is also useful for running short or medium-sized awk
programs from shell scripts, because it avoids the need for a separate
file for the awk program. A self-contained shell script is more
reliable because there are no other files to misplace.
Some Simple Examples,
later in this chapter,
presents several short,
self-contained programs.
awk Without Input FilesYou can also run awk without any input files. If you type the
following command line:
awk 'program'
awk applies the program to the standard input,
which usually means whatever you type on the terminal. This continues
until you indicate end-of-file by typing Ctrl-d.
(On other operating systems, the end-of-file character may be different.
For example, on OS/2 and MS-DOS, it is Ctrl-z.)
As an example, the following program prints a friendly piece of advice
(from Douglas Adams's The Hitchhiker's Guide to the Galaxy),
to keep you from worrying about the complexities of computer programming
(BEGIN is a feature we haven't discussed yet):
$ awk "BEGIN { print \"Don't Panic!\" }"
-| Don't Panic!
This program does not read any input. The \ before each of the
inner double quotes is necessary because of the shell's quoting
rules--in particular because it mixes both single quotes and
double quotes.6
This next simple awk program
emulates the cat utility; it copies whatever you type on the
keyboard to its standard output (why this works is explained shortly).
$ awk '{ print }'
Now is the time for all good men
-| Now is the time for all good men
to come to the aid of their country.
-| to come to the aid of their country.
Four score and seven years ago, ...
-| Four score and seven years ago, ...
What, me worry?
-| What, me worry?
Ctrl-d
Sometimes your awk programs can be very long. In this case, it is
more convenient to put the program into a separate file. In order to tell
awk to use that file for its program, you type:
awk -f source-file input-file1 input-file2 ...
The -f instructs the awk utility to get the awk program
from the file source-file. Any file name can be used for
source-file. For example, you could put the program:
BEGIN { print "Don't Panic!" }
into the file advice. Then this command:
awk -f advice
does the same thing as this one:
awk "BEGIN { print \"Don't Panic!\" }"
This was explained earlier
(see Running awk Without Input Files).
Note that you don't usually need single quotes around the file name that you
specify with -f, because most file names don't contain any of the shell's
special characters. Notice that in advice, the awk
program did not have single quotes around it. The quotes are only needed
for programs that are provided on the awk command line.
If you want to identify your awk program files clearly as such,
you can add the extension .awk to the file name. This doesn't
affect the execution of the awk program but it does make
"housekeeping" easier.
awk ProgramsOnce you have learned awk, you may want to write self-contained
awk scripts, using the #! script mechanism. You can do
this on many Unix systems7 as well as on the GNU system.
For example, you could update the file advice to look like this:
#! /bin/awk -f
BEGIN { print "Don't Panic!" }
After making this file executable (with the chmod utility),
simply type advice
at the shell and the system arranges to run awk8 as if you had
typed awk -f advice:
$ chmod +x advice $ advice -| Don't Panic!
Self-contained awk scripts are useful when you want to write a
program that users can invoke without their having to know that the program is
written in awk.
#!Some systems limit the length of the interpreter name to 32 characters. Often, this can be dealt with by using a symbolic link.
You should not put more than one argument on the #!
line after the path to awk. It does not work. The operating system
treats the rest of the line as a single argument and passes it to awk.
Doing this leads to confusing behavior--most likely a usage diagnostic
of some sort from awk.
Finally,
the value of ARGV[0]
(see Built-in Variables)
varies depending upon your operating system.
Some systems put awk there, some put the full pathname
of awk (such as /bin/awk), and some put the name
of your script (advice). Don't rely on the value of ARGV[0]
to provide your script name.
awk ProgramsA comment is some text that is included in a program for the sake of human readers; it is not really an executable part of the program. Comments can explain what the program does and how it works. Nearly all programming languages have provisions for comments, as programs are typically hard to understand without them.
In the awk language, a comment starts with the sharp sign
character (#) and continues to the end of the line.
The # does not have to be the first character on the line. The
awk language ignores the rest of a line following a sharp sign.
For example, we could have put the following into advice:
# This program prints a nice friendly message. It helps
# keep novice users from being afraid of the computer.
BEGIN { print "Don't Panic!" }
You can put comment lines into keyboard-composed throwaway awk
programs, but this usually isn't very useful; the purpose of a
comment is to help you or another person understand the program
when reading it at a later time.
Caution: As mentioned in
One-Shot Throwaway awk Programs,
you can enclose small to medium programs in single quotes, in order to keep
your shell scripts self-contained. When doing so, don't put
an apostrophe (i.e., a single quote) into a comment (or anywhere else
in your program). The shell interprets the quote as the closing
quote for the entire program. As a result, usually the shell
prints a message about mismatched quotes, and if awk actually
runs, it will probably print strange messages about syntax errors.
For example, look at the following:
$ awk '{ print "hello" } # let's be cute'
>
The shell sees that the first two quotes match, and that
a new quoted object begins at the end of the command line.
It therefore prompts with the secondary prompt, waiting for more input.
With Unix awk, closing the quoted string produces this result:
$ awk '{ print "hello" } # let's be cute'
> '
error--> awk: can't open file be
error--> source line number 1
Putting a backslash before the single quote in let's wouldn't help,
since backslashes are not special inside single quotes.
The next subsection describes the shell's quoting rules.
For short to medium length awk programs, it is most convenient
to enter the program on the awk command line.
This is best done by enclosing the entire program in single quotes.
This is true whether you are entering the program interactively at
the shell prompt, or writing it as part of a larger shell script:
awk 'program text' input-file1 input-file2 ...
Once you are working with the shell, it is helpful to have a basic
knowledge of shell quoting rules. The following rules apply only to
POSIX-compliant, Bourne-style shells (such as bash, the GNU Bourne-Again
Shell). If you use csh, you're on your own.
\) quotes
that character. The shell removes the backslash and passes the quoted
character on to the command.
awk Programs,
for an example of what happens if you try.
Since certain characters within double-quoted text are processed by the shell,
they must be escaped within the text. Of note are the characters
$, `, \, and ", all of which must be preceded by
a backslash within double-quoted text if they are to be passed on literally
to the program. (The leading backslash is stripped first.)
Thus, the example seen
previously
in Running awk Without Input Files,
is applicable:
$ awk "BEGIN { print \"Don't Panic!\" }"
-| Don't Panic!
Note that the single quote is not special within double quotes.
FS should
be set to the null string, use:
awk -F "" 'program' files # correct
Don't use this:
awk -F"" 'program' files # wrong!
In the second case, awk will attempt to use the text of the program
as the value of FS, and the first file name as the text of the program!
This results in syntax errors at best, and confusing behavior at worst.
Mixing single and double quotes is difficult. You have to resort
to shell quoting tricks, like this:
$ awk 'BEGIN { print "Here is a single quote <'"'"'>" }'
-| Here is a single quote <'>
This program consists of three concatenated quoted strings. The first and the third are single-quoted, the second is double-quoted.
This can be "simplified" to:
$ awk 'BEGIN { print "Here is a single quote <'\''>" }'
-| Here is a single quote <'>
Judge for yourself which of these two is the more readable.
Another option is to use double quotes, escaping the embedded, awk-level
double quotes:
$ awk "BEGIN { print \"Here is a single quote <'>\" }"
-| Here is a single quote <'>
This option is also painful, because double quotes, backslashes, and dollar signs
are very common in awk programs.
If you really need both single and double quotes in your awk
program, it is probably best to move it into a separate file, where
the shell won't be part of the picture, and you can say what you mean.
Many of the examples in this Web page take their input from two sample
data files. The first, BBS-list, represents a list of
computer bulletin board systems together with information about those systems.
The second data file, called inventory-shipped, contains
information about monthly shipments. In both files,
each line is considered to be one record.
In the data file BBS-list, each record contains the name of a computer
bulletin board, its phone number, the board's baud rate(s), and a code for
the number of hours it is operational. An A in the last column
means the board operates 24 hours a day. A B in the last
column means the board only operates on evening and weekend hours.
A C means the board operates only on weekends:
aardvark 555-5553 1200/300 B alpo-net 555-3412 2400/1200/300 A barfly 555-7685 1200/300 A bites 555-1675 2400/1200/300 A camelot 555-0542 300 C core 555-2912 1200/300 C fooey 555-1234 2400/1200/300 B foot 555-6699 1200/300 B macfoo 555-6480 1200/300 A sdace 555-3430 2400/1200/300 A sabafoo 555-2127 1200/300 C
The data file inventory-shipped represents
information about shipments during the year.
Each record contains the month, the number
of green crates shipped, the number of red boxes shipped, the number of
orange bags shipped, and the number of blue packages shipped,
respectively. There are 16 entries, covering the 12 months of last year
and the first four months of the current year.
Jan 13 25 15 115 Feb 15 32 24 226 Mar 15 24 34 228 Apr 31 52 63 420 May 16 34 29 208 Jun 31 42 75 492 Jul 24 34 67 436 Aug 15 34 47 316 Sep 13 55 37 277 Oct 29 54 68 525 Nov 20 87 82 577 Dec 17 35 61 401 Jan 21 36 64 620 Feb 26 58 80 652 Mar 24 75 70 495 Apr 21 70 74 514
The following command runs a simple awk program that searches the
input file BBS-list for the character string foo (a
grouping of characters is usually called a string;
the term string is based on similar usage in English, such
as "a string of pearls," or "a string of cars in a train"):
awk '/foo/ { print $0 }' BBS-list
When lines containing foo are found, they are printed because
print $0 means print the current line. (Just print by
itself means the same thing, so we could have written that
instead.)
You will notice that slashes (/) surround the string foo
in the awk program. The slashes indicate that foo
is the pattern to search for. This type of pattern is called a
regular expression, which is covered in more detail later
(see Regular Expressions).
The pattern is allowed to match parts of words.
There are
single quotes around the awk program so that the shell won't
interpret any of it as special shell characters.
Here is what this program prints:
$ awk '/foo/ { print $0 }' BBS-list
-| fooey 555-1234 2400/1200/300 B
-| foot 555-6699 1200/300 B
-| macfoo 555-6480 1200/300 A
-| sabafoo 555-2127 1200/300 C
In an awk rule, either the pattern or the action can be omitted,
but not both. If the pattern is omitted, then the action is performed
for every input line. If the action is omitted, the default
action is to print all lines that match the pattern.
Thus, we could leave out the action (the print statement and the curly
braces) in the previous example and the result would be the same: all
lines matching the pattern foo are printed. By comparison,
omitting the print statement but retaining the curly braces makes an
empty action that does nothing (i.e., no lines are printed).
Many practical awk programs are just a line or two. Following is a
collection of useful, short programs to get you started. Some of these
programs contain constructs that haven't been covered yet. (The description
of the program will give you a good idea of what is going on, but please
read the rest of the Web page to become an awk expert!)
Most of the examples use a data file named data. This is just a
placeholder; if you use these programs yourself, substitute
your own file names for data.
For future reference, note that there is often more than
one way to do things in awk. At some point, you may want
to look back at these examples and see if
you can come up with different ways to do the same things shown here:
awk '{ if (length($0) > max) max = length($0) }
END { print max }' data
awk 'length($0) > 80' data
The sole rule has a relational expression as its pattern and it has no action--so the default action, printing the record, is used.
data:
expand data | awk '{ if (x < length()) x = length() }
END { print "maximum line length is " x }'
The input is processed by the expand utility to change tabs
into spaces, so the widths compared are actually the right-margin columns.
awk 'NF > 0' data
This is an easy way to delete blank lines from a file (or rather, to create a new file similar to the old file but from which the blank lines have been removed).
awk 'BEGIN { for (i = 1; i <= 7; i++)
print int(101 * rand()) }'
ls -l files | awk '{ x += $5 }
END { print "total bytes: " x }'
ls -l files | awk '{ x += $5 }
END { print "total K-bytes: " (x + 1023)/1024 }'
awk -F: '{ print $1 }' /etc/passwd | sort
awk 'END { print NR }' data
awk 'NR % 2 == 0' data
If you use the expression NR % 2 == 1 instead,
the program would print the odd-numbered lines.
The awk utility reads the input files one line at a
time. For each line, awk tries the patterns of each of the rules.
If several patterns match, then several actions are run in the order in
which they appear in the awk program. If no patterns match, then
no actions are run.
After processing all the rules that match the line (and perhaps there are none),
awk reads the next line. (However,
see The next Statement,
and also see Using gawk's nextfile Statement).
This continues until the program reaches the end of the file.
For example, the following awk program contains two rules:
/12/ { print $0 }
/21/ { print $0 }
The first rule has the string 12 as the
pattern and print $0 as the action. The second rule has the
string 21 as the pattern and also has print $0 as the
action. Each rule's action is enclosed in its own pair of braces.
This program prints every line that contains the string
12 or the string 21. If a line contains both
strings, it is printed twice, once by each rule.
This is what happens if we run this program on our two sample data files,
BBS-list and inventory-shipped:
$ awk '/12/ { print $0 }
> /21/ { print $0 }' BBS-list inventory-shipped
-| aardvark 555-5553 1200/300 B
-| alpo-net 555-3412 2400/1200/300 A
-| barfly 555-7685 1200/300 A
-| bites 555-1675 2400/1200/300 A
-| core 555-2912 1200/300 C
-| fooey 555-1234 2400/1200/300 B
-| foot 555-6699 1200/300 B
-| macfoo 555-6480 1200/300 A
-| sdace 555-3430 2400/1200/300 A
-| sabafoo 555-2127 1200/300 C
-| sabafoo 555-2127 1200/300 C
-| Jan 21 36 64 620
-| Apr 21 70 74 514
Note how the line beginning with sabafoo
in BBS-list was printed twice, once for each rule.
Now that we've mastered some simple tasks, let's look at
what typical awk
programs do. This example shows how awk can be used to
summarize, select, and rearrange the output of another utility. It uses
features that haven't been covered yet, so don't worry if you don't
understand all the details:
ls -l | awk '$6 == "Nov" { sum += $5 }
END { print sum }'
This command prints the total number of bytes in all the files in the
current directory that were last modified in November (of any year).
9
The ls -l part of this example is a system command that gives
you a listing of the files in a directory, including each file's size and the date
the file was last modified. Its output looks like this:
-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile -rw-r--r-- 1 arnold user 10809 Nov 7 13:03 awk.h -rw-r--r-- 1 arnold user 983 Apr 13 12:14 awk.tab.h -rw-r--r-- 1 arnold user 31869 Jun 15 12:20 awk.y -rw-r--r-- 1 arnold user 22414 Nov 7 13:03 awk1.c -rw-r--r-- 1 arnold user 37455 Nov 7 13:03 awk2.c -rw-r--r-- 1 arnold user 27511 Dec 9 13:07 awk3.c -rw-r--r-- 1 arnold user 7989 Nov 7 13:03 awk4.c
The first field contains read-write permissions, the second field contains the number of links to the file, and the third field identifies the owner of the file. The fourth field identifies the group of the file. The fifth field contains the size of the file in bytes. The sixth, seventh, and eighth fields contain the month, day, and time, respectively, that the file was last modified. Finally, the ninth field contains the name of the file.10
The $6 == "Nov" in our awk program is an expression that
tests whether the sixth field of the output from ls -l
matches the string Nov. Each time a line has the string
Nov for its sixth field, the action sum += $5 is
performed. This adds the fifth field (the file's size) to the variable
sum. As a result, when awk has finished reading all the
input lines, sum is the total of the sizes of the files whose
lines matched the pattern. (This works because awk variables
are automatically initialized to zero.)
After the last line of output from ls has been processed, the
END rule executes and prints the value of sum.
In this example, the value of sum is 140963.
These more advanced awk techniques are covered in later sections
(see Actions). Before you can move on to more
advanced awk programming, you have to know how awk interprets
your input and displays your output. By manipulating fields and using
print statements, you can produce some very useful and
impressive-looking reports.
awk Statements Versus LinesMost often, each line in an awk program is a separate statement or
separate rule, like this:
awk '/12/ { print $0 }
/21/ { print $0 }' BBS-list inventory-shipped
However, gawk ignores newlines after any of the following
symbols and keywords:
, { ? : || && do else
A newline at any other point is considered the end of the statement.11
If you would like to split a single statement into two lines at a point
where a newline would terminate it, you can continue it by ending the
first line with a backslash character (\). The backslash must be
the final character on the line in order to be recognized as a continuation
character. A backslash is allowed anywhere in the statement, even
in the middle of a string or regular expression. For example:
awk '/This regular expression is too long, so continue it\
on the next line/ { print $1 }'
We have generally not used backslash continuation in the sample programs
in this Web page. In gawk, there is no limit on the
length of a line, so backslash continuation is never strictly necessary;
it just makes programs more readable. For this same reason, as well as
for clarity, we have kept most statements short in the sample programs
presented throughout the Web page. Backslash continuation is
most useful when your awk program is in a separate source file
instead of entered from the command line. You should also note that
many awk implementations are more particular about where you
may use backslash continuation. For example, they may not allow you to
split a string constant using backslash continuation. Thus, for maximum
portability of your awk programs, it is best not to split your
lines in the middle of a regular expression or a string.
Caution: Backslash continuation does not work as described
with the C shell. It works for awk programs in files and
for one-shot programs, provided you are using a POSIX-compliant
shell, such as the Unix Bourne shell or bash. But the C shell behaves
differently! There, you must use two backslashes in a row, followed by
a newline. Note also that when using the C shell, every newline
in your awk program must be escaped with a backslash. To illustrate:
% awk 'BEGIN { \
? print \\
? "hello, world" \
? }'
-| hello, world
Here, the % and ? are the C shell's primary and secondary
prompts, analogous to the standard shell's $ and >.
Compare the previous example to how it is done with a POSIX-compliant shell:
$ awk 'BEGIN {
> print \
> "hello, world"
> }'
-| hello, world
awk is a line-oriented language. Each rule's action has to
begin on the same line as the pattern. To have the pattern and action
on separate lines, you must use backslash continuation; there
is no other option.
Another thing to keep in mind is that backslash continuation and
comments do not mix. As soon as awk sees the # that
starts a comment, it ignores everything on the rest of the
line. For example:
$ gawk 'BEGIN { print "dont panic" # a friendly \
> BEGIN rule
> }'
error--> gawk: cmd. line:2: BEGIN rule
error--> gawk: cmd. line:2: ^ parse error
In this case, it looks like the backslash would continue the comment onto the
next line. However, the backslash-newline combination is never even
noticed because it is "hidden" inside the comment. Thus, the
BEGIN is noted as a syntax error.
When awk statements within one rule are short, you might want to put
more than one of them on a line. This is accomplished by separating the statements
with a semicolon (;).
This also applies to the rules themselves.
Thus, the program shown at the start of this section
could also be written this way:
/12/ { print $0 } ; /21/ { print $0 }
Note: The requirement that states that rules on the same line must be
separated with a semicolon was not in the original awk
language; it was added for consistency with the treatment of statements
within an action.
awkThe awk language provides a number of predefined, or
built-in, variables that your programs can use to get information
from awk. There are other variables your program can set
as well to control how awk processes your data.
In addition, awk provides a number of built-in functions for doing
common computational and string-related operations.
gawk provides built-in functions for working with timestamps,
performing bit manipulation, and for runtime string translation.
As we develop our presentation of the awk language, we introduce
most of the variables and many of the functions. They are defined
systematically in Built-in Variables, and
Built-in Functions.
awkNow that you've seen some of what awk can do,
you might wonder how awk could be useful for you. By using
utility programs, advanced patterns, field separators, arithmetic
statements, and other selection criteria, you can produce much more
complex output. The awk language is very useful for producing
reports from large amounts of raw data, such as summarizing information
from the output of other utility programs like ls.
(See A More Complex Example.)
Programs written with awk are usually much smaller than they would
be in other languages. This makes awk