spatial sound localization model using neural network

Viewer
Transcript

Audio Engineering Society

Convention Paper Presented at the 120th Convention 2006 May 20–23 Paris, France This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.

SPATIAL SOUND LOCALIZATION MODEL USING NEURAL NETWORK R. Venegas1, M. Lara2, R. Correa3, and S. Floody4 1

Departamento de Acústica, Universidad Tecnológica de Chile, Santiago, Ñuñoa, 779-0569, Chile [email protected]

2

Departamento de Acústica, Universidad Tecnológica de Chile, Santiago, Ñuñoa, 779-0569, Chile [email protected] 3

4

Departamento de Física, Universidad Tecnológica Metropolitana, Santiago, Macul, Chile [email protected]

Departamento de Acústica, Universidad Tecnológica de Chile, Santiago, Ñuñoa, 779-0569, Chile [email protected]

ABSTRACT This work presents the design, implementation and training of a spatial sound localization model for broadband sound in an anechoic environment inspired in the human auditory system and implemented using artificial neural networks. The data acquisition was made experimentally. The model consist in a nonlinear transformer which possesses one module of ITD, ILD and ISLD extraction and a second module constituted by a neural network that estimates the sound source position in elevation and azimuth angle. A comparative study of the model performances using three different bank filters and a sensitivity analysis of the neural network inputs are also presented. The average error is 2.3º. This work was supported by the FONDEI fund of the Universidad Tecnológica de Chile. 1.

INTRODUCTION

The spatial sound localization is the ability to identify the spatial location of a particular sound and eventually, the association with the sound source that generate this sound and its spatial location. Two perspectives exist for the perceptual phenomenon studies: The cognition theory and natural sciences [1].

This work focuses principally in the natural sciences perspective; this includes physics, physiology and psychology. We use information obtained in these fields to include them in a computational model that is inspired by them. The objective is to emulate a human ability in a machine and not explain or validate the human perceptual phenomenon.

Venegas et al.

Spatial Sound Localization model using NN

The spatial sound localization, in general, is a nonuniqueness problem; one of the main reasons is that each person has a different physiology and consequently a unique Head Related Transfer Function (HRTF) [2, 3, 4, 5]. The use of Computational Intelligence techniques such as the Artificial Neural Network and its learning ability, offer a powerful tool in order to solve the sound localization problem [6, 7, 8, 9] and, at the same time, specifically present a methodology for its resolution. The data for the system training was measured in the anechoic chamber of the Universidad Tecnológica de Chile. The measurement procedure and results are presented in the reference [10].

sensitivity analysis to inputs of the best configuration are also presented. The sensitivity analysis consists in analyzing how many inputs to the neural network delivers similar results to the neural net with 20 inputs and how the error varies as the inputs diminish. All process was implemented in Matlab environment. 2.

REFERENCE SYSTEM

For the present research the elevation angle varies (down to up) from -90º to 90º and the azimuth angle varies from 0º to 360º (right to left). This reference system is adopted in order to avoid discontinuities and redundancies in the absolute values of the neural network target codification.

The principal cues for sound localization in azimuth are the Interaural Time differences (ITD) and the Interaural Level Differences (ILD) [1, 11, 12]. For elevation the Interaural spectral level differences (ISLD), which are the energy difference per frequency band between the left and right ear arrival signal, obtained filtering those signals with a bank filter, gives relevant information for its identification [1, 11, 13]. A spatial sound localization model for broadband sound in an anechoic environment is presented. The model is a nonlinear transformer which input are the sounds received by each ear and its output is the spatial estimation of sound source position in azimuth and elevation angle. The model is constituted by two modules. The first module is used to extract the ITD, ILD and ISLD, with these cues; we obtain the feature vector for each direction. The second module is a three layer feed forward neural network; its inputs are the feature vectors. The neural network estimates the sound source position.

Figure 1 Reference System 3.

SOUNDS AND MEASUREMENT PROCEDURE

The table 1 show the sounds measured for the model construction. 1708 sounds were measured.

A comparative study of the configuration performance using three different bank filters for the ISLD calculation is presented. The bank filters were: 18 Equivalent Rectangular Bandwidth PattersonHoldsworth Auditory filter (for simplicity ERB filter or ERB bank filter from now on) [14, 15, 16], 18 filter of the Lyon passive cochlear model (Lyon filter or Lyon Bank Filter) [16, 17, 18] and 12 Third Octave Filter [19].

Sound 01 02 03 04

Description Logarithmic Swept Sine whitout emphasis 1 [s] White Noise 100 [ms] White Noise 250 [ms] White Noise 500 [ms]

Table 1 Sounds used in the model construction

An error measure for the neural network configuration applied to the sound localization problem [20], the research process of the adequate configuration and a

AES 120th Convention, Paris, France, 2006 May 20–23 Page 2 of 21

Venegas et al.

Spatial Sound Localization model using NN

At the center of the anechoic chamber an artificial head was placed in a fixed position during the measurement procedure. The sound source (loudspeaker) was placed at 1.2 meters from the reference system’s origin. The measurements for the training set were made in 5º azimuth steps for 0º elevations, 10º azimuth steps for elevation angles between -60º and 60º (taking 15º steps) and 20º azimuth steps for 75º elevations. The test set was crafted from random direction measurements. The sampling frequency was 44.1 kHz. Additionally, the Head Related Transfer Functions (HRTF) was obtained from a logarithmic swept. For more details and results see reference [10, 21]. 4.

(

)

m

−1 ⎞ ⎟ (2 ) 2 ⎠

⎛ 2N

ITD = arg max C xR xL [m] − nint ⎜ ⎝

Where xR [n] is the right ear signal, xL [ n] is the left ear signal, N is the length of the signal in samples, and nint ( ) is the nearest integer function. Figure 3 shows the ITD for the logarithmic swept (see table 1). 4.2.

Interaural Level Differences - ILD

The Interaural Level Differences are the differences in level of the sound arriving at both ears [1]. Figure 4 shows the ILD for the logarithmic swept (see table 1). ⎛ ⎜ ILD = 10 log ⎜ ⎜ ⎜ ⎝

MODEL DESCRIPTION

Figure 2 shows a scheme of the model.

4.3.

N

∑ ( x [ n] )

2

∑ ( x [ n] )

2

R

n =0 N

L

n =0

⎞ ⎟ ⎟ ⎟ ⎟ ⎠

(3 )

Interaural Spectral Level Differences ISLD

The Interaural Spectral level differences are the energy difference per frequency band between the left and right ear arrival signal obtained by filtering these signals with a bank filter. The procedure to calculate ISLD is described by the equations 4 to 9.

X ( R , L ) ( f ) = 20 log FFT ( x( R , L ) [n])

Figure 2 Block Diagram of the model 4.1.

(4 )

Interaural Time Differences - ITD

The Interaural Time Differences are the differences in the time arrival of the arrival sound at the two ears. The method to calculate ITD is based in the proposed by Jeffress [22]. It is used the cross-correlation between the signals of both ears since the cross-correlation vector is proportional to ITD [22, 23]. Mathematically:

⎧ N − m −1 ⎫ xR [ m + n]xL [n] m ≥ 0 ⎪ ⎪ C xR xL [m] = ⎨ n =1 ⎬ ⎪C [− m] m < 0 ⎪⎭ ⎩ xR xL

∑

(1 )

After obtaining the signal spectrum with a FFT that allows 1 Hz resolution, a smooth process is applied using a moving average (eq. 5), if the smoothing is not strong, this smooth process does not have an influence in the localization [24]. For this work BW = 160 [Hz], in reference [21] is presented an analysis of this parameter.

X s( R , L ) ( f ) =

[

1 BW

]

fp = f +

BW 2

fp = f −

BW 2

∑

X ( R, L) ( f p )

(5 )

With: f ∈ f 0 , f1 ; f 0 > 80 [Hz]; f1 < 21.97 [kHz];

f s = 44.1 [kHz].

AES 120th Convention, Paris, France, 2006 May 20–23 Page 3 of 21

Venegas et al.

Spatial Sound Localization model using NN

Figure 3 ITD for Logarithmic swept in [samples]

Figure 4 ILD for Logarithmic swept in [dB] The smooth spectrums of two ears are filtered with a bank filter. In this paper, are used three different bank filters: ERB Patterson- Holdsworth Auditory Bank Filter [14, 15, 16], Lyon Bank Filter [16, 17, 18] and Third Octave Bank Filter [19]. Figures 5 to 7 show the frequency response of the three bank filters.

Filters with central frequency greater to 1.25 kHz were used. The purpose of this process is to offer significant information for elevation detection. For frequencies smaller than to 1.5 kHz the localization is dominated by ITD and ILD [1]. Finally, ISLD is obtained from the equations 6, 7 and 8.

AES 120th Convention, Paris, France, 2006 May 20–23 Page 4 of 21

Venegas et al.

Spatial Sound Localization model using NN

Y((k , F )) (ω ) = X s( R,L

( R, L )

EF

(k )

R,L)

( f ) + 20 log

⎛ = 10 ⋅ log ⎜ ⎜ ⎝

1 BW( k , F )

( BF(

k ,F )

f = fH (k ,F )

∑(

(f)

)

0.1⋅Y((k ,F )) ( f ) R ,L

10

f = fL k ,F )

(6 ) ⎞ ⎟ ⎟ ⎠

R L ISLDF ( k ) = EF( ) ( k ) − EF( ) ( k )

(7 )

(8 )

Where: F: Bank Filter {ERB, Lyon or Third Octave} BF( k , F ) ( f ) : Magnitude of frequency response of the

kth filter of the F bank filter.

The feature vector components are the neural network inputs. In order to normalize these inputs, the average of the maximum absolutes values of the sounds that conform the database were calculated, after the ISLD extraction using three banks filters independently. All inputs were normalized, and allowed to obtain three different training and test sets.

BW( k , F ) : Bandwidth of the kth filter of the F bank

filter.

6.

EF( R , L ) : Energy per frequency band of the right (R)

A complete theoretical and practical development is explained in references [25, 26].

or left (L) ear signal filtered with the bank filter F. f L (k , F ) : Low cut off frequency of the kth filter of the bank filter F. f H (k , F ) : High cut off frequency of the kth filter of the bank filter F.

For ERB and Lyon Bank filter k=1:18. For Third Octave Bank filter k=1:12. 5.

FEATURE VECTOR

For all database elements the equations 1 to 8 were applied to obtain ILD, ITD and ISLD, with these cues, the feature vector is constructed. The feature vector is a function of the sound source spatial location.

V (θ , ϕ ) = [ ISLD ILD ITD ]k ×1

(9 )

With k=20 for ISLD calculated using ERB and Lyon Bank Filter and k=14 for ISLD calculated using Third Octave Bank Filter

NEURAL NETWORKS

The normalized feature vector components are the neural network inputs. The neural network targets are the azimuth and elevation angle. The targets were normalized to the interval [0, 1] for azimuth and [-1, 1] for elevation angle. Neural networks are trained with the features extracted from measured sounds, not with the HRTF [9] or the features obtained from the convolution between different signals and HRTF [6] because the reported results by Jost [27] in artificial heads and Vlieglen [28] in human beings about the influences of measurements signal level or level in humans sound localization are not captured completely by the auralization process, in the other hand, the background noise effect is ignored or remained constant. In order to avoid these problems it is more realistic the feature extraction process from experimental data for the neural network training. Figures 8 to 10 show the initial training set for ISLD calculated with ERB, Lyon and Third Octave Bank Filter respectively. The initial training set contained 79.7% of the database elements (representing 1361 directions) and the test set contained 20.3% (representing 347 directions).

AES 120th Convention, Paris, France, 2006 May 20–23 Page 5 of 21

Venegas et al.

Spatial Sound Localization model using NN

Figure 5 Frequency Response of ERB Bank Filter

Figure 6 Frequency Response of Lyon Bank Filter

AES 120th Convention, Paris, France, 2006 May 20–23 Page 6 of 21

Venegas et al.

Spatial Sound Localization model using NN

Figure 7 Frequency Response of Third Octave Bank Filter

Figure 8 Initial Training Set – ISLD calculated using ERB Bank Filter

AES 120th Convention, Paris, France, 2006 May 20–23 Page 7 of 21

Venegas et al.

Spatial Sound Localization model using NN

Figure 9 Initial Training Set – ISLD calculated using Lyon Bank Filter

Figure 10 Initial Training Set – ISLD calculated using Third Octave Bank Filter

AES 120th Convention, Paris, France, 2006 May 20–23 Page 8 of 21

Venegas et al.

Spatial Sound Localization model using NN

It is important to analyze the difference between the training sets. Figure 11 shows the absolute difference between the initial training set that contain inputs calculated using ERB and Lyon Bank Filter. From input number 10 the differences are significant. The periodic structure is due to the nature of sounds (broadband sounds). After the feature extraction process, the inputs are similar, however, in a non normalized ISLD, differences around 5 [db] for some directions were found [21].

(purelin) respectively. Configurations 20-13-2, 2015-2 and 20-17-2 with sigmoidal logarithmic (logisg), sigmoidal hyperbolic tangent (tansig) and linear transfer function (purelin) respectively were also studied. Each configuration was trained at three independent times for each training set (inputs calculated with different banks filters) with random weights initialization. 7.

ERROR MEASURES

A global error measure [20] is described by the equations 10 to 13. E (TR,TE ) (θ , ϕ ) = y(TR ,TE ) (θ , ϕ ) − t(TR ,TE ) (θ , ϕ )

Ec(

TR ,TE )

(θ ,ϕ ) =

E p ( TR, TE ) =

Figure 11 Absolute differences between initial training sets calculated using ERB and Lyon Bank Filter The training algorithms used in this investigation are presented in table 2

Algorithm Quasi-Newton Back propagation Scaled Conjugate Gradient Conjugate Gradient with Powell/Beale Restarts Fletcher-Powell Conjugate Gradient Polak-Ribiére Conjugate Gradient One-Step Secant Table 2

Name trainbfg trainscg

Eg =

1 H

∑( H

E(

TR ,TE )

i

(θ ,ϕ ) )

(10 )

(11 )

i

1 ⎡ (TR ,TE ) Et (θ ) + Et(TR,TE ) (ϕ )⎤⎦ 2⎣

1 ⎡ E p (TR ) + E p ( TE ) ⎤ ⎦ 2⎣

(12 )

(13 )

Where E : Mean Localization Error for θ and ϕ in training (TR) or test (TE).

θ : Azimuth angle.

traincgb

ϕ : Elevation angle.

traincgf traincgp trainoss

y : Neural network Output for θ and ϕ . y : Target for θ and ϕ .

Training algorithms H : Number of Train, H = 3

The neural networks studied were three layer feed forward nets with 20 neurons in the input layer, 13 neurons in the hidden layer and 2 neurons in the output layers (20-13-2), 20-15-2 and 20-17-2 with sigmoidal hyperbolic tangent (tansig), sigmoidal logarithmic (logsig) and linear transfer function

Ec : Average error of the net configuration for θ and ϕ in training (TR) or Test (TE).

AES 120th Convention, Paris, France, 2006 May 20–23 Page 9 of 21

Venegas et al.

Spatial Sound Localization model using NN

E p : Total Error in training (TR) or Test (TE) of the

net configuration. Eg : Global error

Other error measure, used to obtain the surface of configuration, is the average Euclidean distance between the neural network output and the target.

D=

1 N

∑ N

i =1

( yθ − tθ )i2 + ( yϕ − tϕ )i

2

(14 )

Where N is the number of elements 8.

For this problem, neural nets with tansig-logsigpurelin transfer functions always presents better performance than neural nets with logsig-tansigpurelin transfer functions, independently of the amount of neurons in the hidden layer and the training algorithms. The increase in the performance is not significant when the neurons in the hidden layer augment. A greater number of neurons in the hidden layer makes the weights surface more complex, with many local minimums, and consequently it makes more difficult the convergence of the training process and increase the computational cost [25, 26, 29]. Subsequent analysis was implemented using configurations 2015-2 tansig-logsig-purelin. In order to reduce the error, the test set elements with great error, were included in the training set.

RESULTS AND DISCUSSIONS

A heuristic search of the configuration was made. The first test was focused on analyzing the amount of neurons in the hidden layer and the transfer functions of input and hidden layer. Figures 12 to 17 represent the global error for configurations with tansig-logsigpurelin and logsig-tansig-purelin transfer functions.

Figures 18, 19 and 20 show the global error of the configuration 20-15-2 tansig-logsig-purelin using different training set and test set. The amount of elements for the training and test sets is expressed in percentage relative to the total amount of database elements (1708 elements).

Figure 12 Global Error – ISLD calculated using ERB Bank Filter - Neural Net Configuration with tansig-logsigpurelin transfer functions.

AES 120th Convention, Paris, France, 2006 May 20–23 Page 10 of 21

Venegas et al.

Spatial Sound Localization model using NN

Figure 13 Global Error – ISLD calculated using ERB Bank Filter - Neural Net Configuration with logsig-tansigpurelin transfer functions..

Figure 14 Global Error – ISLD calculated using Lyon Bank Filter - Neural Net Configuration with tansig-logsigpurelin transfer functions.

AES 120th Convention, Paris, France, 2006 May 20–23 Page 11 of 21

Venegas et al.

Spatial Sound Localization model using NN

Figure 15 Global Error – ISLD calculated using Lyon Bank Filter - Neural Net Configuration with logsig- tansigpurelin transfer functions.

Figure 16 Global Error – ISLD calculated using Third Octave Bank Filter - Neural Net Configuration with tansiglogsig-purelin transfer functions.

AES 120th Convention, Paris, France, 2006 May 20–23 Page 12 of 21

Venegas et al.

Spatial Sound Localization model using NN

Figure 17 Global Error – ISLD calculated using Third Octave Bank Filter - Neural Net Configuration with logsigtansig-purelin transfer functions.

Figure 18 Global error – ISLD calculated using ERB Bank Filter – 20-15-2 tansig-logsig-purelin for different Training/Test set percentages.

AES 120th Convention, Paris, France, 2006 May 20–23 Page 13 of 21

Venegas et al.

Spatial Sound Localization model using NN

Figure 19 Global error – ISLD calculated using Lyon Bank Filter – 20-15-2 tansig-logsig-purelin for different Training/Test set percentages.

Figure 20 Global error – ISLD calculated using Third Octave Bank Filter – 20-15-2 tansig-logsig-purelin for different Training/Test set percentages

AES 120th Convention, Paris, France, 2006 May 20–23 Page 14 of 21

Venegas et al.

Spatial Sound Localization model using NN

The reference for the analysis of the training/test set elements percentages were the training and test set calculated with ERB Bank filter. In the first modification, 18 test set elements were included in the ERB training set, for the Lyon and Third Octave test set also including 18 test elements in the respective training set. However, the elements included in each training set were totally different since the greater error, in the neural net answer, varies according to the training set. This phenomenon is due to the ISLD and specifically because the central frequencies of the filters are different and consequently the ISLD gives better information for certain directions. In this case the results are comparable in terms of the training/tests set elements percentages. When increasing the amount of the training set elements, the global error diminishes, but from four modifications the increasing is not significant. The best training algorithm, for this problem, is the Quasi-Newton Back propagation (trainbfg) (see figure 12 to 20). The space of configurations is a highly dimensional space composed by the neural network parameters (neurons, layers, transfer functions, aggregate methods, connections, etc). In this paper a surface of configurations, using all configurations studied, training algorithms and an error measure, is constructed. In order to obtain the best configuration, the argument that minimizes the surface of configurations was found. The error measure used is the average Euclidean distance (eq. 14) Neurons Transfer Function Inputs (feature vector) Train Set Test Set Train Algorithm Error Measure (Average Euclidian Distances) [º/direction] Table 3

Figures 21 and 22 show the surface of configurations for the test set and training set respectively, figure 23 shows the surface of configurations considering the test and training set like a single set. (See Appendix A, for neural net configuration description). Table 3 presents the parameters and characteristics of the best model. Table 4 presents the median of the average Euclidean distances for configurations trained with ERB, Lyon and Third Octave training set (42 configurations for each bank filter), with this, we can confirm that the best algorithm for this problem is Quasi Newton Back propagation. In the other hand, the performance of the ERB and Lyon configurations is better than Third Octave configurations. The performance difference between ERB and Lyon configurations is not significant, with this, we can assure the robustness of the neural network to solve the broadband sound localization problem. The inputs are different (see figure 11) and the final results are similar. A sensitivity analysis with the best configuration (see table 3) parameter was implemented. It consists in analyzing how many inputs to the neural network delivers similar results to the best configuration and how the error varies as the inputs diminish. Each configuration was trained at three independent times with random weights initialization.

20-15-2 Sigmoidal hyperbolic tangent (tansig) Sigmoidal logarithmic (logsig) Linear (purelin) ISLD calculated using ERB Bank Filter (18 filters) IID ITD 1433 Directions (Total=1708) 275 Directions (Total=1708) Quasi-Newton Backpropagation Test Set 3.9 Train Set 1.5 Train Set + Test Set 2.3

Parameters and characteristic of the best model

AES 120th Convention, Paris, France, 2006 May 20–23 Page 15 of 21

Venegas et al.

Spatial Sound Localization model using NN

ERB Bank Filter Error Average

Lyon Bank Filter Error Average

Third Octave Bank Filter Error Average

º (direction ⋅ configuration)

º (direction ⋅ configuration)

º (direction ⋅ configuration)

Training Algorithm

Test

Training

Trainbfg Trainscg Traincgb Traincgf Traincgp Trainoss

11,15 11,97 12,75 12,73 13,47 15,8

2,04 5,23 5,75 6,66 7,52 10,21

Table 4

Training + Test 3,82 6,54 7,1 7,83 8,66 11,28

Test

Training

11,2 11,61 13,01 12,39 13,46 15,17

2,37 5,59 6,77 7,17 7,98 11,16

Training + Test 4,06 6,75 7,96 8,18 9,02 11,94

Test

Training

13,12 14,37 14,69 14,98 16,05 17,66

3,55 7,58 8,71 9,74 10,58 13,81

Training + Test 5,37 8,89 9,85 10,76 11,63 14,56

Median of Average Euclidian distances for configurations training using ERB, Lyon and Third Octave Bank Filter

Figure 21 Surface of Configurations for Test Set – All configurations.

AES 120th Convention, Paris, France, 2006 May 20–23 Page 16 of 21

Venegas et al.

Spatial Sound Localization model using NN

Figure 22 Surface of Configurations for Training Set – All configurations

Figure 23 Surface of Configurations for Training and Test Set like a single set – All configurations.

AES 120th Convention, Paris, France, 2006 May 20–23 Page 17 of 21

Venegas et al.

Spatial Sound Localization model using NN

Figure 24 Sensitivity analysis – Median of Average Euclidean error vs number of inputs The input of the first training were ITD, ILD and ISLD calculated using 17 ERB filters, the second training with ITD, ILD and ISLD calculated using 16 ERB filters and so on. The training with three inputs was realized with ITD, ILD and ISLD calculated only with the greater central frequency of the ERB bank filter (see fig 5). The inputs of the last training were only ITD and ILD.

9.

CONCLUSSIONS

In this work, is described the design, implementation and training of the spatial sound localization model based on interaural time, level and spectral level differences for broadband sound using neural networks in anechoic environment.

Figure 24 shows the median of the average Euclidean distance versus the number of inputs.

A total of 804 configurations, including the sensitivity analysis, were studied.

The analysis confirms that 20 inputs are necessary to obtain an approximate error of 4º with the test set. Considering the test and training set as a single set, from 14 inputs obtain an average error around 4º, this result is comparable with the best performance of the Third Octave neural net configuration, however this interesting coincidence, needs more study. It is necessary to carry out more tests to quantify the real influence of ISLD components and considering the combinations of the inputs.

The signal for the training and test were logarithmic swept and white noise of different lengths. The system architecture, due to the nature of sounds, is invariant to the length of the signal.

The final model takes 1 [s] in a personal computer without exclusive dedication.

An error measure are presented, from this measure, the increase in the performance is not significant when the neurons in the hidden layer are greater than 15 neurons. At first, the 20 neurons in the input layer correspond a heuristic search, the sensitivity analysis confirm this search. The best training algorithm for this problem is the Quasi Newton Back propagation (trainbfg).

AES 120th Convention, Paris, France, 2006 May 20–23 Page 18 of 21

Venegas et al.

Spatial Sound Localization model using NN

From table 4, ERB and Lyon configurations performance are better than the Third Octave configurations. The difference in the performance between ERB and Lyon configurations, in average, is not significant, with this, we can assure the robustness of the neural network to solve the sound localization problem for broadband sound. With different inputs the final results are similar. The neural network inputs are less than the employees in other works [7, 8] and reveal to be sufficient on the results basis obtained by the model. The average error, measured with average Euclidean distance, of the final model is 2.3 [º/direction], this error does not consider the experimental error estimated in 3º [21]. In the practice, the real estimate average error is 5.3º. The error is comparable to the presented by other authors in models and measurements of human sound localization for broadband sounds [30, 31, 12]. The neural networks offer a powerful tool for modeling and/or for generating resolution methodologies for non-uniqueness problems or in problems that the mathematical model, in the classic sense, is very complex or is not realized yet. The principal limitations of the presented model are: Only tested with broadband sound and its performance was tested only in anechoic environment. Future works will be developed taking into account these limitations and a more accurate sensitivity analysis to quantify the real influences of each input. 10.

ACKNOWLEDGEMENTS

This work was supported by Fondo de Desarrollo de la Investigacón FONDEI of the Universidad Tecnológica de Chile. 11.

REFERENCES

[1] J. Blauert.,”Spatial Hearing, The Psychophysics of Human Sound Localization”, MIT Press, 2nd ed., Cambridge MA, USA, 1983. [2] V. Algazzi, P. Diventi, R. Duda, “Subject dependent Transfer Function in Spatial Hearing” in Proc. 1997 IEEE MWSCAS Conference, 1997.

[3] S. Carlile, D. Pralong, “The location-dependent nature of perceptually salient features of the human related transfer functions”, J. Acoust. Soc. Am., Vol. 95 No. 6, pp: 3445-3459, 1994, June. [4] V. Algazi, R. Duda, D. Thompson, C. Avendano, “The CIPIC HRTF Database”, Proceedings IEEE Workshop on Applications of Signal Processing to audio and Electroacoustics, pp:99102, Mohonk Mountain House, New Paltz, NY, 2001, October 21-24 [5] E. M. Wenzel, M. Arruda, D. J. Kistler, F. L. Wighman, “Localization using nonindividualized head related transfer functions”, J. Acoust. Soc. Am. Vol. 94 No 1,pp 111-123, 1993, July. [6] T.R. Anderson, J. A. Janko, R.H. Gilkey, “An Artificial neural network model of human sound”, J. Acoust. Soc. Am., Vol 92 No. 4, pp.2298, 1992, October. [7] J. Backman, M. Karjalainen, “Modelling of Human directional and Spatial Hearing using neural networks” in Proceedings IEEE International Conference Acoustics Speech and Signal Processing (ICASSP’93) - IEEE, Vol. 1, pp.125-128, Minneapolis, Minnesota, USA, 1993, April. [8] C. Jin, M. Schenkel, S. Carlile, “Neural system identification model of human sound localization” J. Acoust. Soc. Am. Vol. 108 No. 3 Pt. 1, pp: 1215-1235, 2000, September [9] D. Nandy, J. Ben-Arie, “Neural models for auditory localization based on spectral cues”, Neurological Research; Vol. 23 No. 5, pp: 489500, 2000, July. [10] R. Venegas, M. Lara, R. Correa, S. Floody, “Medición de HRTF y extracción de características significativas para localización sonora”, Semacus 2005, Universidad Pérez Rosales, 2005, Octubre. [11] D. Begault, “3-D Sound for Virtual Reality and Multimedia”, NASA Ames Research Center, Moffet Field California, 2000.

AES 120th Convention, Paris, France, 2006 May 20–23 Page 19 of 21

Venegas et al.

Spatial Sound Localization model using NN

[12] K. Martin, “A computational model of Spatial Hearing”, MSc. dissertation, Massachusetts Institute of Technology MIT, 1995. [13] . Roffler, R. Butler, “Factors that influence the localization of sound in the vertical plane”, J. Acoust. Soc. Am., Vol 43, No 6, pp: 1255-1259, 1968. [14] R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, H. Allerhand, “Complex sounds and auditory images” in Auditory Physiology and Perception, (Eds) Cazals, Y., Demany, K., and Horner, K., Pergamon, Exford, 1992. [15] M. Slaney, “An efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank”, Apple Technical Report No. 35, Advanced Technology Group, Apple Computer, Inc, Cupertino. CA, 1993. [16] M. Slaney, “Auditory Toolbox Version 2”, Technical Report #1998-010, Interval Research Corporation, 1998. [17] R. F. Lyon, “A computational model of filtering, detection, and compression in the cochlea,” in Proc. IEEE Inr. Conf. Acousr., Speech, Signal Processing, Pans, France, 1982, May. [18] M Slaney, Lyon’s Cochlear Model, Technical Report #13, Apple Computer, Inc, 1988. [19] IEC 61260, Electroacoustic - Octave-band and fractional octave-band filters, 1995. [20] R. Venegas, M. Lara, R. Correa, S. Floody, “Modelo de localización sonora espacial 2D para sonidos de banda ancha”, EMFIMIN 2005, Universidad de Santiago, 2005, Noviembre. [21] R. Venegas, M. Lara, R. Correa, S. Floody, “Diseño, Implementación y Entrenamiento de un sistema de Identificación Sonora Espacial Neurodifuso”, Reporte Técnico, Proyecto FONDEI, Universidad Tecnológica de Chile, 2005. [22] L. A. Jeffress, ‘‘A place theory of sound localization,’’ J. Comp. Physiol. Psychol. Vol. 41, pp: 35–39., 1948.

[23] W. Lindemann, “Extension of a binaural crosscorrelation model by a contralateral inhibition. I. Simulation of lateralization by contralateral inhibition”, J. Acoust. Soc. Am., Vol. 80, pp: 1608-1622, 1986, December. [24] A. Kulkarni, H.S. Colburn, “Role of spectral detail in sound-source localization”, Nature, Vol. 396 (6713): 747-749; 1998, December. [25] N. Kasakov, “Foundations of Neural Network, fuzzy system, and Knowledge Engineering”, A Bradford Book, MIT Press, Cambridge, MA, 1998 [26] M. Gupta, L. Jin, N. Homma, “Static and dynamic Neural Network”, IEEE Press, John Wiley and Sons, 2003. [27] Jost, A., Begault, D., “Observed Effects of HRTF measurements signal level”, presented at the AES International 21st Conference, St. Petersburg, Russia, 2002, June. [28] Vliegen, J., and Van Opstal, J., “The influence of duration and level on human sound localization”, J. Acoust. Soc. Am. Vol. 115 No. 4, pp: 17051713, 2004, April. [29] S. Lawrence, C. Lee, A. Chung,”What size Neural Network Gives Optimal Generalization? Convergence properties of Backpropagation”, Technical Report, UMIACS-TR-96-22 and CSTR-3617, Institute for Advanced Computer Studies, University of Maryland, 1996. [30] R. Butler,“ Monoaural and binaural Localization of Noise Burst Vertically in the Median Sagittal Plane”, J. Aud. Res. Vol 3, pp: 245-246., 1969 [31] J. Hebrank, D. Wright, “Are two ears necessary for median plane localization”. J. Acoust. Soc. Am., Vol 56, pp: 935-938, 1974, September. 12.

APPENDIX A

Table 5 shows all the neural network configurations and their parameters. Each training process is considered to be a different configuration because random weights initialization were used.

AES 120th Convention, Paris, France, 2006 May 20–23 Page 20 of 21

Venegas et al.

Spatial Sound Localization model using NN

Index

Neurons

Transfer Function

1:3 4:6 7:9 10:12 13:15 16:18 19:21 22:24 25:27 28:30 31:33 34:36 37:39 40:42 43:45 46:48 49:51 52:54 55:57 58:60 61:63 64:66 67:69 70:72 73:75 76:78 79:81 82:84 85:87 88:90 91:93 94:96 97:99 100:102 103:105 106:108 109:111 112:114 115:117 118:120 121:123 124:126

20-13-2 20-13-2 20-15-2 20-15-2 20-17-2 20-17-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2 20-13-2 20-13-2 20-15-2 20-15-2 20-17-2 20-17-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2 20-13-2 20-13-2 20-15-2 20-15-2 20-17-2 20-17-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2 20-15-2

tansig-logsig-purelin logsig-tansig-purelin tansig-logsig-purelin logsig-tansig-purelin tansig-logsig-purelin logsig-tansig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin logsig-tansig-purelin tansig-logsig-purelin logsig-tansig-purelin tansig-logsig-purelin logsig-tansig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin logsig-tansig-purelin tansig-logsig-purelin logsig-tansig-purelin tansig-logsig-purelin logsig-tansig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin tansig-logsig-purelin

% Training/Test Elements 79.7-20.3 79.7-20.3 79.7-20.3 79.7-20.3 79.7-20.3 79.7-20.3 80.9-19.1 81.6-18.4 83.3-16.7 83.9-16.1 84.2-15.8 84.5-15.5 84.8-15.2 85.2-14.8 79.7-20.3 79.7-20.3 79.7-20.3 79.7-20.3 79.7-20.3 79.7-20.3 80.9-19.1 81.6-18.4 83.3-16.7 83.9-16.1 84.2-15.8 84.5-15.5 84.8-15.2 85.2-14.8 79.7-20.3 79.7-20.3 79.7-20.3 79.7-20.3 79.7-20.3 79.7-20.3 80.9-19.1 81.6-18.4 83.3-16.7 83.9-16.1 84.2-15.8 84.5-15.5 84.8-15.2 85.2-14.8

Bank Filter used in ISLD calculation

ERB Bank Filter

Lyon Bank Filter

Third Octave Bank Filter

Table 5 Neural network configurations and its parameters (related to x axis in the figures 21 to 23) (Parameters of the sensitivity analysis configurations are not included)

AES 120th Convention, Paris, France, 2006 May 20–23 Page 21 of 21

spatial sound localization model using neural network

Report #13, Apple Computer, Inc, 1988. [19] IEC 61260, Electroacoustic - Octave-band and fractional octave-band filters, 1995. [20] R. Venegas, M. Lara, ...

Download PDF

2MB Sizes 2 Downloads 315 Views

Report

spatial sound localization model using neural network

Recommend Documents