# Locally Adjusted Robust Regression

## Abstract

Automatic human age estimation has considerable potential

applications in human computer interaction and

multimedia communication. However, the age estimation

problem is challenging. We design a locally adjusted robust

regressor (LARR) for learning and prediction of human

ages. The novel approach reduces the age estimation errors

significantly over all previous methods. Experiments on two

aging databases show the success of the proposed method

for human aging estimation.

## 1. Introduction

Human faces convey a significant amount of nonverbal

information for human-to-human communication. Our humans

have the ability to accurately recognize and interpret

faces in real time. Given a captured face image, various

attributes can be estimated from it, such as identity, age,

gender, expression, and ethnic origin [6]. Facial attributes

play a crucial role in real applications including multimedia

communication and Human Computer Interaction (HCI).

For example, if the user’s age is estimated by a computer, an

Age Specific Human Computer Interaction (ASHCI) system

may be developed. Such a system could be used for

secure internet access control in order to ensure young kids

have no access to internet pages with adult materials; A

vending machine can refuse to sell alcohol or cigarettes to

the underage people [19] [9]. In image and video retrieval,

users could retrieve their photographs or videos by specifying

a required age range [19].

However, automatic age estimation from human face images

is a challenging problem. The main difficulty is that

different persons age quite differently. The aging process

is determined by not only the person’s gene but also many

external factors, such as health, living style, living location

and weather conditions. Males and females may also age

differently. Figure 1 shows some face images with different

ages.

Figure 1. Face aging: each row shows images of the same individual

at different ages.

Surprisingly, there are only a few publications on age estimation

in the literature. Existing face-image-based age

estimation methods can be divided into three main categories

[5]: anthropometric model [17] [22], aging pattern

subspace [9], and age regression [18, 19, 24, 23]. The anthropometric

model uses the cranio-facial development theory

and facial skin wrinkle analysis. The growth related

changes of face shape and texture patterns are measured to

categorize a face into several age groups. Such methods

are suitable for coarse age estimation rather than continuous

or refined classification [17] or modelling ages just for

young people [22]. The AGing pattErn Subspace (AGES)

method [9] models a sequence of personal aging face images

by learning a subspace representation, in order to handle

incomplete data such as missing ages in the sequence.

The age of a testing face is determined by the projection in

the subspace that can best reconstruct the face image. For

the regression methods, the regression coefficients are estimated

from the training data with an assumption of the

regression function such as a Quadratic Model (QM) [19].

Yan et al. [24, 23] also dealt with the age uncertainty by

formulating a semi-definite programming problem [24] or

an EM-based algorithm [23]. Fu et al. showed the existence

of an aging manifold [5] visualized in face image subspaces

[8]. They suggested the fusion of manifold learning and a

quadratic regression model to improve the age estimation

performance.

The traditional quadratic model [19] for age regression is

based on a Least Square Estimation (LSE) criterion which

is not robust to outliers. The outliers could come from

some incorrectly labelled ages. In addition, the optimization

based on the LSE criterion minimizes the empirical

risk which usually cannot generalize well especially when

a small number of training data are available.

In this paper, we propose a novel method for automatic

age estimation. Our method is called a Locally Adjusted

Robust Regressor (LARR) for learning and prediction of the

aging patterns. The advantages of the proposed method will

be demonstrated with extensive experiments.

The remainder of the paper is organized as follows. In

Section 2, the quadratic regression function is introduced

which was used in previous approaches. In Section 3, we

introduce the support vector regression method which was

adopted as our robust regressor. A local adjustment of the

regression results is presented in Section 4. Experimental

evaluations of the proposed approach and comparisons with

previous methods are given in Section 5, and finally the conclusion

is provided in Section 6.

## 2. Quadratic Regression

Given the extracted features for each face image, people

usually took a regression function, L = f(x), to characterize

the relationship between the extracted features, x,

and the age labels, L. A typical choice of the regression

function, f, is the Quadratic Model (QM) [18],

,
where
is the estimate of the age, w_{0}

is the offset, x and x^2 are the extracted feature vector and

its square, and w_{1} and w_{2} are weight vectors.

The QM method has been used for age regression in previous

approaches [18] [19]. The model parameters are optimized

by minimizing the difference between the actual ages

of the individuals and the estimated ages. The loss function

usually corresponds to a Least Squares Estimation (LSE)

criteria. However, there are some disadvantages for the QM

method: (1) the aging is a complex nonlinear regression

problem, especially for a large span of years, e.g., 0-90. The

simple quadratic function may not model properly the complex

aging process; (2) the least square estimation is sensitive

to outliers that come from incorrect labels in collecting

a large image database; and (3) the least square estimate

criterion only minimizes the empirical risk which may not

generalize well for unseen examples, especially with a small

number of training examples. Therefore, we need to seek a

robust model for modelling the aging patterns.

For the purpose of robust aging regression, we adopt the

Support Vector Regression (SVR) method [25]. The SVR

might attack the three limitations of the traditional quadratic

regression model.

## 3. Support Vector Regression

The basic idea of SVR is to find a function f(x) that has

most ε deviation from the actually obtained target y_{i}
for the

training data x_{i}, and at the same time is as flat as possible.

In other words, we do not care errors as long as they are

less than ε. In comparison with the conventional
quadratic

loss function shown in Figure 2(a), the ε-insensitive
loss

function of SVR is shown in Figure 2(b).

Figure 2. (a) quadratic regression loss function; (b) ε-insensitive

loss function.

Consider the problem of approximating the set of data

with
a linear

function,

The optimal regression function [25] is given by

where constant C >0 determines the trade-off between the

flatness of f and data deviations, and
are
slack variables

to cope with otherwise infeasible constraints on the

optimization problem of (2). The ε-insensitive loss
function

as shown in Figure 2(b) is

The primal problem of (2) can be solved more efficiently

in its dual formulation [25] resulting in the final solution

given by

where
are Lagrange multipliers. The value of b in

Eq. (1) can be determined by plugging Eq. (4) into Eq. (1)

[10].

A nonlinear regression function can be obtained by using

kernels, in the same manner as a nonlinear Support Vector

Machine (SVM) for classification [25]. Different kernels,

such as polynomials, sigmoid, or Gaussian radial basis

functions, can be used depending on the tasks. For our robust

age regression, the Gaussian radial basis function kernel

was adopted. A radial basis function is

where γ is a constant to adjust the width of the Gaussian

function. Given the kernel mapping, the solution of the nonlinear

SVR is obtained as [25],

## 4. Locally Adjusted Robust Regression

Is it “good” enough for human age prediction using the

SVR? To answer this question, let us look at an estimation

result using the SVR. Figure 3 shows the predicted

ages (squares) with respect to the ground truth ages (circles).

Note that this is not a regression curve. One thousand

data points are sorted in ascending order of the ground truth

ages, i.e., from 0 to 91 years for females. The predicted

ages are obtained from the SVR method. From this figure,

we observe that the SVR method can estimate the global

age trend, but cannot predict the ages precisely. For example,

the SVR predictions give bigger age values for many

younger people, and smaller age values for some older people.

In some cases, the estimated age values could be far

away from the true ages, e.g., more than 40 years.

Figure 3. The true ages (circles) versus the estimated (squares) for

1,000 females

Why the SVR method cannot show better performance

than we expect for age prediction? The reason can be in two

aspects: First, the problem of age prediction is really challenging

because of the diversity of aging variation. Second,

the SVR method attempts to find a flat curve within a small

ε tube to approximate the data in order to obtain good
generalization

capability. But the age data may distribute like

the (green) irregular curve shown in Figure 4. One cannot

expect the SVR to estimate an irregular curve like this because

of the over-fitting problem. Further, one cannot assign

a large ε to enclose all true data points inside the
ε tube, as

demonstrated in motion estimation [12]. So how to model

the aging function by allowing the irregular distribution of

true ages?

### 4.1. Local Adjustment

One feasible solution is to adjust the age regression

values locally so that the estimated age values can be

“dragged” towards the true ages. We call it a Locally Adjusted

Robust Regressor (LARR). The idea of LARR is illustrated

in Figure 4. Suppose the predicted age value by

SVR is f(x), corresponding to the input data x. The point

f(x) is displayed by the black dot on the regression curve.

The estimated age, f(x), may be far away from the true age

value, L, shown as the red dot on the true age trajectory

curve. The idea of the LARR method is to slide the estimated

value, f(x), up and down (corresponding to greater

and smaller age values) by checking different age values,

t ∈ [f(x) − d, f(x) + d], to see if it can come up with a

better age estimation. The value d indicates the range of

ages for local search. Hopefully the true age value, L, is

also within this range, i.e., L ∈ [f(x) − d, f(x) + d].

Figure 4. Illustration of the LARR idea.

Therefore the LARR method is a two-step procedure: (1)

a robust regression over all ages of the training data by using

the SVR method. This step can be considered as a global

regression process; (2) a local adjustment within a limited

range of ages centered at the regression result.

Now the key issue is how to verify different age values

within a specified range for the purpose of local adjustment.

Remember our goal is to “drag” the initially estimated age

value, f(x), by the global regressor, towards the true age,

L, as close as possible. We take a classification approach

to locally adjust or verify different ages, considering each

age label as one class. For our classification-based local adjustment,

there are many possible choices of classifiers, but

here we adopt a linear Support Vector Machine (SVM) [25]

for our local age adjustment. The main reason is that the

SVM can learn a classifier given a small number of training

examples, which has been demonstrated successfully in

problems such as face expression recognition [11], audio

classification [14], and image retrieval [13].

### 4.2. Limited Binary Tree Search

The classical SVMs deal with the two-class classification

problem. To extend to a multi-class classification, there

are three typical ways: (1) learning classifiers for each pair

of classes, and taking a binary tree search in testing [20]

[15]; (2) training SVMs for each class against all the remaining

classes; and (3) training SVMs for all classes simultaneously.

However, the last two schemes are not appropriate

for our purpose here. Because in our local adjustment,

only partial classes of age data are involved. If the

last two schemes are used, the SVMs have to be re-trained

dynamically for each adjustment, which is computationally

expensive. While in the first scheme there is no need to

re-train the SVMs. All pair-wise SVM classifiers can be

trained off-line, and only a limited number of classes are

involved in the binary tree search in testing.

In general, the number of pair-wise comparisons is n−1

for each test in an n-class classification problem [20] [15].

Here in our age adjustment, the number of pair-wise comparisons

is limited to m − 1 when only m classes are involved

in each local adjustment, and m < n.

## 5. Experiments

We perform age estimation experiments on a large

database, the UIUC-IFP-Y Aging Database. The database

contains 8,000 high-resolution RGB color face images captured

from 1,600 different voluntary Asian subjects, 800

females and 800 males, in the age range from 0 to 93

years. Each subject has 5 near frontal images with provided

ground truth ages. To our knowledge, this human age

database is the largest one ever reported.

For age estimation, we first use a face detector to find

the face area in each image, and label the eye corner locations

of each face subject. Based on the face and eye corner

locations, the face images are cropped, scaled, and transformed

to 60×60 gray-level patches. The grey-level values

of each face image are normalized to reduce the effect of

out-door illumination changes. The face image patches are

fed into a manifold learning module. We use the orthogonal

locality preserving projections [2] method for age manifold

learning, similar to the approach in [5]. Each face image

can be projected onto the age manifold to extract a feature

vector. We use the first 150 features for each face image.

The system then learns a robust regression function using

the kernel SVR method for females and males separately.

A small ε value was chosen, ε
= 0.02, for the loss function

in Eqn. (3). In SVR learning, parameters C and γ are

determined on a validation set. Different ranges such as 4,

8, 16, 32, and 64 were tried for the local adjustment of the

global regression results, and compared to see the effect of

local adjustment. The purpose of choosing the powers of

two is to simplify the binary search structure. The pair-wise

linear SVM classifiers were used for the local adjustment,

centered at the age value (or label) obtained from the global

regressor.

To evaluate the accuracy of our algorithms for age estimation

on the UIUC-IFP-Y age database, we perform a

standard 4-fold cross validation test. The test was executed

on the female and male subsets separately. The reason is

that we found females and males age quite differently in the

database. The performance of age estimation can be measured

by two different measures: the Mean Absolute Error

(MAE) and the Cumulative Score (CS). TheMAE is defined

as the average of the absolute errors between the estimated

ages and the ground truth ages,

where l_{k} is the ground truth age for the test image
is

the estimated age, and N is the total number of test images.

The MAE measure was used previously in [19] [18]

[9] [24]. The cumulative score [9] is defined as CS(j) =

is the number of test images

on which the age estimation makes an absolute error

no higher than j years.

Experimental results are shown in Table 1 and Figure 5

(a) and (b). The first and second columns in Table 1 show

the MAEs for females and males in the UIUC-IFP-Y age

database, separately. Different ranges, e.g., 4, 8, 16, 32, and

64, were tried for local adjustment of the global regression

results. One can see that the local adjustment truly reduces

the errors of the global regression. Different ranges of adjustment

do have different MAEs. For comparison, we also

show the results using purely the SVM classifiers in the last

row. The best LARR result in terms of MAE is 5.29 years

for females when the local search range is 64 classes, while

it is 5.30 years for males when the adjust range is 32 classes.

The ranges of local adjustment depend on the data and the

global regression results.

The CS measures are shown in Figure 5 (a) and (b)

for females and males, separately. One can observe that

the LARR methods (with different ranges for local adjustment)

improve the score significantly over the pure regression

method especially for lower error levels, e.g.,m < 10

years. We do not show the cumulative scores for 4 and 8

classes here in order to not mess up the figures. Those two

CS curves are even lower than 16 classes.

curves are even lower than 16 classes.

We also compare our results with all previous methods

reported on the UIUC-IFP-Y age database. As shown in

Table 2, our LARR method has the MAEs of 5.25 and 5.30

years for females and males, separately, which are explicitly

smaller than all previous results. Our method brings about

24% deduction of MAEs over the best result of previous approaches.

The comparisons of cumulative scores are shown

in Figure 6 (a) and (b).

There is a public available age database, the FG-NET

[1]. In this age database, the age ranges from 0 to 69 years,

and each face image has 68 labelled points characterizing

shape features. The shape features can be combined with

appearance features to form a face representation, called

Active Appearance Models (AAMs) [3]. The AAMs use

200 parameters to model each face for the purpose of age

estimation [9] [24] [23]. To evaluate our LARR method on

FG-NET, we use the same AAM features as in [9] [24] [23].

A test strategy, called Leave-One-Person-Out (LOPO),

was usually taken for the FG-NET age database [9] [24]

[23]. We follow the same strategy and compare our results

with previous ones. The experimental results are shown in

the third column of Tables 1 and 2. One can see that our

LARR method has an MAE of 5.07 years which is lower

than all previous methods listed in Table 2. Our best MAE

was obtained using either 4 or 8 classes for local adjustment

as shown in Table 1. Increasing the local search ranges for

the LARR method will make the errors bigger.

The cumulative scores of our LARR method on the FGNET

database are shown in Figure 5 (c). LARR8 means using

8 classes for local adjustment. The cumulative scores of

the pure SVM are much lower than the pure SVR for most

error levels, which indirectly indicates the significance of

constraining the SVM search in a local range. The cumulative

score comparisons are shown in Figure 6 (c). Our

LARR method performs much better than the QM an MLP

methods. The method of RUN1 [24] is close to our LARR

in low age error levels, but worse than LARR in high levels.

In contrast, the method of RUN2 [23] is close to our LARR

in high age error levels, but worse than the LARR in low error

levels. Overall, our LARR method has higher accuracy

than both the RUN1 and RUN2 on the FG-NET database.

From the experimental evaluations, we summarize that

(1) the LARR method gives better age estimation than the

purely robust regression by SVR or purely classification by

SVM, and (2) the LARR method has lower errors than all

the state-of-the-art approaches to age estimation.

## 6. Conclusion

We have presented a new approach to automatic age
estimation.

The Support Vector Machine (SVM) and Support

Vector Regression (SVR) methods are investigated for

age prediction for the first time. A Locally Adjusted Robust

Regressor (LARR) was designed to estimate ages with

high performance. Experimental evaluations on a large age

database and the public available FG-NET database show

that our LARR method performs better than all the state-ofthe-

art approaches. We expect to see more applications of

the LARR method to other challenging real problems.