5.4. Character handling
C is widely used for character and string handling applications. This
is odd, in some ways, because the language doesn't really have any
built-in string handling features. If you're used to languages that know
about string handling, you will almost certainly find C tedious to begin
with.
The standard library contains lots of functions to help with string
processing but the fact remains that it still feels like hard work. To
compare two strings you have to call a function instead of using an
equality operator. There is a bright side to this, though. It means that
the language isn't burdened by having to support string processing
directly, which helps to keep it small and less cluttered. What's more,
once you get your string handling programs working in C, they do tend to
run very quickly.
Character handling in C is done by declaring arrays (or allocating them
dynamically) and moving characters in and out of them ‘by hand’.
Here is an example of a program which reads text a line at a time from
its standard input. If the line consists of the string of characters
stop , it stops; otherwise it prints the length of the line.
It uses a technique which is invariably used in C programs; it reads the
characters into an array and indicates the end of them with an extra
character whose value is explicitly 0 (zero). It uses the library
strcmp function to compare two strings.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define LINELNG 100 /* max. length of input line */
main(){
char in_line[LINELNG];
char *cp;
int c;
cp = in_line;
while((c = getc(stdin)) != EOF){
if(cp == &in_line[LINELNG-1] || c == '\n'){
/*
* Insert end-of-line marker
*/
*cp = 0;
if(strcmp(in_line, "stop") == 0 )
exit(EXIT_SUCCESS);
else
printf("line was %d characters long\n",
(int)cp-in_line);
cp = in_line;
}
else
*cp++ = c;
}
exit(EXIT_SUCCESS);
} Example 5.6
Once more, the example illustrates some interesting methods used widely
in C programs. By far the most important is the way that strings are
represented and manipulated.
Here is a possible implementation of strcmp , which
compares two strings for equality and returns zero if they are the same.
The library function actually does a bit more than that, but the added
complication can be ignored for the moment. Notice the use of
const in the argument declarations. This shows that the
function will not modify the contents of the strings, but just inspects
them. The definitions of the standard library functions make extensive
use of this technique.
/*
* Compare two strings for equality.
* Return 'false' if they are.
*/
int
str_eq(const char *s1, const char *s2){
while(*s1 == *s2){
/*
* At end of string return 0.
*/
if(*s1 == 0)
return(0);
s1++; s2++;
}
/* Difference detected! */
return(1);
} Example 5.7
5.4.1. Strings
Every C programmer ‘knows’ what a string is. It is an array of
char variables, with the last character in the string
followed by a null. ‘But I thought a string was something in double
quote marks’, you cry. You are right, too. In C, a sequence like
this
"a string"
is really a character array. It's the only example in C where you can
declare something at the point of its use.
Be warned: in Old C, strings were stored just like any other
character array, and were modifiable. Now, the Standard states that
although they are are arrays of char , (not const
char ), attempting to modify them results in undefined
behaviour.
Whenever a string in quotes is seen, it has two effects: it provides
a declaration and a substitute for a name. It makes a hidden declaration
of a char array, whose contents are initialized to the character values
in the string, followed by a character whose integer value is zero. The
array has no name. So, apart from the name being present, we have
a situation like this:
char secret[9];
secret[0] = 'a';
secret[1] = ' ';
secret[2] = 's';
secret[3] = 't';
secret[4] = 'r';
secret[5] = 'i';
secret[6] = 'n';
secret[7] = 'g';
secret[8] = 0;
an array of characters, terminated by zero, with character values in
it. But when it's declared using the string notation, it hasn't got
a name. How can we use it?
Whenever C sees a quoted string, the presence of the string itself
serves as the name of the hidden array—not only is the string an
implicit sort of declaration, it is as if an array name had been given.
Now, we all remember that the name of an array is equivalent to giving
the address of its first element, so what is the type of this?
"a string"
It's a pointer of course: a pointer to the first element of the hidden
unnamed array, which is of type char , so the pointer is of
type ‘pointer to char ’. The situation is shown in
Figure 5.7.
For proof of that, look at the following program:
#include <stdio.h>
#include <stdlib.h>
main(){
int i;
char *cp;
cp = "a string";
while(*cp != 0){
putchar(*cp);
cp++;
}
putchar('\n');
for(i = 0; i < 8; i++)
putchar("a string"[i]);
putchar('\n');
exit(EXIT_SUCCESS);
} Example 5.8
The first loop sets a pointer to the start of the array, then walks
along until it finds the zero at the end. The second one ‘knows’
about the length of the string and is less useful as a result. Notice
how the first one is independent of the length—that is a most
important point to remember. It's the way that strings are handled in
C almost without exception; it's certainly the format that all of the
library string manipulation functions expect. The zero at the end allows
string processing routines to find out that they have reached the end of
the string—look back now to the example function
str_eq . The function takes two character pointers as
arguments (so a string would be acceptable as one or both arguments). It
compares them for equality by checking that the strings are
character-for-character the same. If they are the same at any point,
then it checks to make sure it hasn't reached the end of them both with
if(*s1 == 0) : if it has, then it returns 0 to show that
they were equal. The test could just as easily have been on
*s2 , it wouldn't have made any difference. Otherwise
a difference has been detected, so it returns 1 to indicate failure.
In the example, strcmp is called with two arguments which
look quite different. One is a character array, the other is a string.
In fact they're the same thing—a character array terminated by zero
(the program is careful to put a zero in the first ‘empty’ element
of in_line ), and a string in quotes—which is
a character array terminated by a zero. Their use as arguments to strcmp
results in character pointers being passed, for the reasons explained to
the point of tedium above.
5.4.2. Pointers and increment operators
We said that we'd eventually revisit expressions like
(*p)++;
and now it's time. Pointers are used so often to walk down arrays that
it just seems natural to use the ++ and --
operators on them. Here we write zeros into an array:
#define ARLEN 10
int ar[ARLEN], *ip;
ip = ar;
while(ip < &ar[ARLEN])
*(ip++) = 0; Example 5.9
The pointer ip is set to the start of the array. While it
remains inside the array, the place that it points to has zero written
into it, then the increment takes effect and the pointer is stepped one
element along the array. The postfix form of ++ is
particularly useful here.
This is very common stuff indeed. In most programs you'll find
pointers and increment operators used together like that, not just once
or twice, but on almost every line (or so it seems while you find them
difficult). What is happening, and what combinations can we get? Well,
the * means indirection, and ++ or
-- mean increment; either pre- or post-increment. The
combinations can be pre- or post-increment of either the pointer or the
thing it points to, depending on where the brackets are put. Table 5.1 gives a list.
++(*p) |
pre-increment thing pointed to |
(*p)++ |
post-increment thing pointed to |
*(p++) |
access via pointer, post-increment pointer |
*(++p) |
access via pointer which has already been incremented |
Table 5.1. Pointer notation
Read it carefully; make sure that you understand the combinations.
The expressions in the list above can usually be understood after
a bit of head-scratching. Now, given that the precedence of
* , ++ and -- is the same in all
three cases and that they associate right to left, can you work out what
happens if the brackets are removed? Nasty, isn't it? Table 5.2 shows that there's only one case where the brackets have to
be there.
With parentheses |
Without, if possible |
++(*p) |
++*p |
(*p)++ |
(*p)++ |
*(p++) |
*p++ |
*(++p) |
*++p |
Table 5.2. More pointer notation
The usual reaction to that horrible sight is to decide that you don't
care that the parentheses can be removed; you will always use
them in your code. That's all very well but the problem is that most
C programmers have learnt the important precedence rules (or at least
learnt the table above) and they very rarely put the
parentheses in. Like them, we don't—so if you want to be able to
read the rest of the examples, you had better learn to read those
expressions with or without parentheses. It'll be worth the effort in
the end.
5.4.3. Untyped pointers
In certain cases it's essential to be able to convert pointers from
one type to another. This is always done with the aid of casts, in
expressions like the one below:
(type *) expression
The expression is converted into ‘pointer to
type’, regardless of the expression's previous type. This
is only supposed to be done if you're sure that you know what you're
trying to do. It is not a good idea to do much of it until you have got
plenty of experience. Furthermore, do not assume that the cast
simply suppresses diagnostics of the ‘mismatched pointer’ sort
from your compiler. On several architectures it is necessary to
calculate new values when pointer types are changed.
There are also some occasions when you will want to use
a ‘generic’ pointer. The most common example is the
malloc library function, which is used to allocate storage
for objects that haven't been declared. It is used by telling it how
much storage is wanted—enough for a float , or an array
of int , or whatever. It passes back a pointer to enough
storage, which it allocates in its own mysterious way from a pool of
free storage (the way that it does this is its own business). That
pointer is then cast into the right type—for example if
a float needs 4 bytes of free store, this is the flavour of
what you would write:
float *fp;
fp = (float *)malloc(4);
Malloc finds 4 bytes of store, then the address of that
piece of storage is cast into pointer-to-float and assigned to the
pointer.
What type should malloc be declared to have? The type
must be able to represent every known value of every type of pointer;
there is no guarantee that any of the basic types in C can hold such
a value.
The solution is to use the void * type that we've already
talked about. Here is the last example with a declaration of
malloc :
void *malloc();
float *fp;
fp = (float *)malloc(4);
The rules for assignment of pointers show that there is no need to use
a cast on the return value from malloc , but it is often
done in practice.
Obviously there needs to be a way to find out what value the argument
to malloc should be: it will be different on different
machines, so you can't just use a constant like 4. That is what the
sizeof operator is for.
|