Towards modelling the Semantics of Natural Human Body

Download Report

Transcript Towards modelling the Semantics of Natural Human Body

Towards modelling the Semantics of Natural Human Body Movement

Hayley Hung http://www.dcs.qmul.ac.uk/~hswh/report/ A method of feature extraction using temporal frame differencing and motion moments provids a trajectory-based matching framework. In this case, six different gestures are performed by 9 subjects and matched against six subjects, which have been used to train the model. Tests are carried out to ascertain the overall performance of the condensation algorithm, by varying parameters within the algorithm. Results have been analysed based on the performance of each gesture subject. From this experiment, extrapolations to further studies in the area of human behaviour analysis have been made.

• • • Human body movement can be categorised into: – Gait or posture is usually an unconscious form of body movement, which can be observed when a person is walking. – – – Actions are usually body movements that consciously interact with objects. Gesture is a subconscious communicative form, which aids the ability of a person to communicate. Sign language is a conscious form of communicative language between people.

Temporal information about a gesture is important since it indicates where a gesture begins and ends. Context can be from proceeding and/or preceding gestures, but also from interaction with objects or other people in an environment.

• • • • • • This report concentrates on the application of the condensation algorithm for gesture recognition. performs an exhaustive search of the search space and is therefore more likely to provide the best global match for an observed gesture, and one in the training set. it is able to propagate multiple probable states, hence allowing for ambiguity in the performance of a gesture. This method is advantageous in that the training data is not processed before it is used for inference. Inference occurs through Dynamic Time Warping (DTW) of a set of training data, over a particular time interval. These warped trajectories are matched against an observed gesture for the most likely classification. The heuristic of this search algorithm is the conditional density of a particular gesture over a large number of samples.

• A basic feature extraction technique has been used to represent the motions of the gestures. The technique used is temporal frame differencing. • Due to the limitations of this technique, all test subjects are required to perform the gestures facing the camera and the algorithm has been implemented using MATLAB. • They are also asked to try to minimise movement from the rest of the body as the gestures are being performed with either one or both hands. • In this application, 6 gestures have been taken from 18 different subjects, each sitting facing the camera.

Feature extraction

• • • The simplest form of feature extraction, through motion moments was used.

Motion moments are obtained using two-frame temporal differencing. This involves taking the difference between consecutive frames and creating a binary image FEATURES – Motion area: – Centroid coordinates: – Displacement of centroid coordinates: – Elongation:

Condensation algorithm

• • • • Since the advantage of the condensation algorithm is that it is able to propagate the probability densities of many states in the search space, it is particularly useful for multiple tracking problems as well as gesture recognition. Though the technique used here takes the motion of the gesture to be one motion trajectory, it is quite possible for the condensation algorithm to track both hands separately, for example. Dynamic Time Warping allows generalisation of a particular gesture by distorting each feature trajectory by three different parameters; , , and which represent amplitude, rate and phase adjustments. Each state in the search space contains values for these three variables, as well as a variable to represent the model or gesture. generally represents a different number, for each gesture. A state at time

t

is defined as • Essentially, the condensation algorithm consists of four basic steps; initialisation, selection, prediction and updating, as shown in Figure 3.1.

Figure 3.1: High level block diagram of the condensation algorithm.

Figure 3.2: Flow Diagram of the matching and selection processes of the condensation algorithm.

• • Each trajectory described in this figure represents the variation of one particular feature over time.

The training data was treated as a vector of N values , where • • The search space was firstly initialised by choosing S sample states (typically of the order of 1000). This produced a set samples. • The purpose of the algorithm to find the most likely state that creates the best match for the input or observation data, • The observation vector for a particular trajectory

i

(or each variable of the feature set) is

• To find likelihoods for each state, DTW,according to the state parameters, must be performed on the model data. This is calculated as the probability of the observation given the state is given by: • where (3.8) • • and where is the size of the temporal window over which matching from

t

occurs.

backwards to are estimates of the standard deviation for each of the trajectories

i

for the whole sequence.

• Equation (3.8) represents the mean distance between the test gesture and a DTW'd model for sized window of the trajectories.

• • The term performs the dynamic time warping of the trajectory

i

data. from the training The model number is , the trajectory is shifted by , interpolated by , and scaled by . • • • Using S values of it is possible to create a probability distribution of the whole search space at one time instant. Each conditional probability acts as a weighting for its corresponding state and with successive iterations, the distributions of the states in the search space cluster round areas which represent the more likely gestures. The weighting or normalised probabilities are calculated as follows:

• • • • From these weights, it is possible to predict the probability distribution over the search space at the next time instant. Thus, more probable states are more likely to be propagated over the total time of the observed sequence. It is emphasised here that more than one probable state can be propagated at each time instant. The sample set is first initialised by sampling uniformly for each parameter for every set of samples, • • • S samples are initialised, where S is chosen to be 1000. Once the samples have been initialised, the states to be propagated must be chosen. This is done by constructing a cumulative probability distribution using the weights, as shown in Figure 3.2

• A value r is chosen uniformly and then the smallest value of the cumulative weight, is chosen such that • • • • • • where (t-1) represents the current time frame and t indexes the samples and weights of the next time frame that is being predicted. The corresponding state is then selected for propagation. With this method of selection, larger weights are more likely to be chosen. The ordering of the cumulative weight distribution is therefore irrelevant. To avoid getting trapped in local minima or maxima, 5% to 10% of the sample set are randomly chosen and initialised, as described above. After states have been selected for propagation, the parameters for that state, at the next time step are predicted using the following equations:

• After this stage, the new state is evaluated using the probability • If the conditional probability is below zero then the state is predicted again using the above equations and is recalculated. • • If this process needs to be repeated more than a predetermined number of times, then, the state is deemed unlikely and it is reinitialised using the random initialisation described previously. The number of `tries' is a pre-determined amount and this was chosen to be 100. Once all S new states have been generated, the normalised weights, are recalculated for state selection and propagation at the next time instant. • The process of selection, prediction, and update is repeated until the end of the observed sequence or the gesture is considered recognised is reached.

Model Training

• • • The model was trained using six subjects, chosen at random, to represent the model data or training set for the algorithm. The model trajectories were created by the interpolating each example of a particular gesture, to the mean length of the trajectory for that gesture. Then, the mean value at each time step, for each of the four trajectories, was calculated. Hence, each gesture was represented by four model trajectories. These were then used to match with the observed gestures. The adjustable parameter values were: – – – – – 1. The window size : The total length of gestures varied between 30 and 90 frames 2. The number of samples: 3. The percentage of randomised samples at each iteration: To minimise the risk of the condensation algorithm getting stuck in local maxima in the search space, 4. The warp ranges of the state parameters : control over the amount of generalisation of the model. 5. The number of tries : Size of the local search

The Recognition Process: Model Fitting

• • • • • The video sequences that were used were manually trimmed, so that each sequence contained one gesture, performed once. It would have been artificial to find the most likely gesture at the end of the sequence since the likelihood of each gesture evolved in time and depended on which part of the gesture was being matched. Also, in reality, such a system would have to deal with a whole sequence of many gestures so that the start and end of each gesture would be unknown. Therefore, it was important to find a suitable method of measuring when a gesture had been recognised by the algorithm. The method of measuring when a gesture had been recognised that was used involved taking the proportion of the highest likelihood against the second highest likelihood. If this fell below a predefined threshold, for a certain number of consecutive frames, then the states would be reset and the gesture, considered recognised.

As well as this, a maximum value for the phase parameter was used so that when it was greater than it, the gesture would also be considered recognised. This was used to emulate a situation where the end of a gesture might be unknown. • The algorithm stops when either one of these two criteria are satisfied.

Experiment

• The condensation algorithm was implemented using MATLAB to recognise six possible gestures, as shown in 4.1. • • Video images of these six gestures were taken from sixteen different subjects, four times. The gestures were ; `come', `go', a high wave, a low wave, point left, and point right. All subjects were asked to perform the `come' and `go' gestures with both hand and the waving pointing gestures with their right hand. • • • Training data was produced by taking six subjects, chosen at random. Nine subjects were chosen for input observation sequences. Confusion matrices were generated for all results. The mean values, and interpolation to the mean lengths of each gesture for six of the subjects was calculated to create the training set. • The rest of the subjects were used to test the algorithm.

• • • • • Figure 4.1: The six gestures that were used for recognition. clockwise from top left;come, go, point right, point left, a high wave and a low wave.

The parameter values were chosen to be: • • 1. The window size : 10; 2. The number of samples : 1000; 3. The percentage of randomised samples at each Iteration: 10 %; 4. The warp ranges of the state parameters $\rho $ and $\alpha $: $1 \pm 0.2$; 5. The number of tries : 100;

Gesture Characteristics of Individual Subjects

• The overall performance of the algorithm was not good. Most subjects only had 2 or 3 gestures that were recognised at all. Only one subject had 5 out of 6 gestures recognised once or more. The TPRs, FPRs and accuracy for each gesture for every subject is shown in Tables 4.1, 4.2, and 4.3.

• Figure 4.2: Feature vectors of the `come' gesture from the subject, `Chris'.

• Figure 4.3: Feature vectors of the `come' gesture from the training set.

• Figure 4.4: Feature vectors of the low waving gesture from the training set.

• Figure 4.5: Feature vectors of the left pointing gesture from the training set.

• Figure 4.6: Feature vectors of an example of left pointing gesture from the subject, `Cth'.

• Figure 4.7: Feature vectors of an example of left pointing gesture from the subject, `Kate'.

The Recognition of Individual Gestures

• The `come' and `go ' gestures were the most distinguishable from other gestures since they had comparatively high TPRs and amongst the highest accuracy values. • However, when tests were run with just `come' and `go' as possible classifications, the two gestures were completely indistinguishable. • Observing the actual training trajectories of `come' and `go' in Figures 4.3 and 4.8, we can see that they are also virtually indistinguishable. • Hence, the feature extraction technique was not able to capture the hand pose or orientation of the hands and wrist,which was where the fundamental difference between the two gestures lay.

• Figure 4.8: Feature vectors of the go gesture from the training set.

• Figure 4.9: Comparison of the motion area trajectories of the `come' and `go' gesture for subject, `Kate'.

• Figure 4.10: Comparison of the motion area trajectories of the `come' and `go' gesture for subject, `Chris'.

• Figure 4.11: Feature vectors of the high waving gesture from the training set.

The Effect of Classifying Fewer Gestures

The Effect of Altering Parameters in the Algorithm