C Tutorial: Playing with strings

Paring tokens with strtok

The strtok function provides a convenient way of splitting a string into a number of substrings (tokens) that are all separated from each other with some set of characters that cannot be part of the string. For example, let's say that you want to extract words that are separated by "-" or "+" characters, such as:

Cut-down---a----tree++with-a-herring

We can use the following code to break it into the component tokens:

/* strtok example */ /* Paul Krzyzanowski */ #include <stdio.h> #include <string.h> /* needed for strtok */ int main(int argc, char **argv) { char text[] = "Cut-down---a----tree++with-a-herring"; char *t; int i; t = strtok(text, "-+"); for (i=0; t != NULL; i++) { printf("token %d is \"%s\"\n", i, t); t = strtok(NULL, "-+"); } }

Download this file

Save this file by control-clicking or right clicking the download link and then saving it as strtok1.c.

Compile this program via:

gcc -o strtok1 strtok1.c

If you don't have gcc, You may need to substitute the gcc command with cc or another name of your compiler.

Run the program:

./strtok1

Note that the first call to strtok contains the string that we want to tokenize as the first parameter and a string containing a set of characters that are considered token separators as the second parameter. Any character that is in the second string will never be part of the token. There is no provision for quoting or escaping a separator character so that it can be part of the token.

Running this program gives us the following output:

token 0 is "Cut" token 1 is "down" token 2 is "a" token 3 is "tree" token 4 is "with" token 5 is "a" token 6 is "herring"

It is important to not that strtok does not allocate memory and create new strings for each of the tokens it finds. All the data still resides in the original string. Whenever strtok is called, it continues from where it left off and skips separators until it gets a valid character. This becomes the start of the next token and will be the return value from strtok. It then skips over the valid characters until it gets to the next separator character. That location in memory is overwritten with a 0, which is the C convention for the end of a string. The next time you call strtok, if the first parameter is a NULL (0), it knows to continue from where it left off last time. Hence, strtok mangles the original string. If you need it then you should be sure to make a copy of it (see strdup).

Recommended

The Practice of Programming

 

The C Programming Language

 

The UNIX Programming Environment