Audio Programming in C&Swift: HRTF Rendering in a simple app

If you’ve read the previous topics of convolution and binaural rendering, you probably wondering that ‘what if we can do this processing with all the virtual loudspeakers simultaneously, even rendering our sound for all directions in real-time?’ 

Let’s have a glimpse of those databases first!



This is the database from KEMAR. MIT, and we can see as the elevation goes up, the amount of measuring spots on each elevation layer is decreasing. Also, it is very clear that the filename can indicate the azimuth and elevation properties. As a result, it is possible that using a program to parse the filename then read them all into memory!

The function below is used to parse filenames and real all impulse signals in. I offered two versions here:

  • In C:


  • In Swift:


Once we run this function, all the files would be loaded in memory like this:


So, right now you can pick up any one of them as the HRIR to convolve with your dry signal, just like what we did in the previous topic. And I think the most fascinating point here is about how you define a method to pick up the HRIR properly.

Try to imagine there is a sound object close to you in a virtual 3D environment and the distance/direction between you and the object will always be changed with your head/body moving. If we can do the interpolation that rounds such amount loudspeakers to infinite number; also select the HRIR based on the current location of listener and object for convolution in real-time, is that means that we can create a sound object with movement?

The answer is yes, but will be very complex to implement.

An important reason that why I use Swift here is: The CoreMotion frameworks offered by apply can be helpful to simulate the head-tracking.

● Core Motion
CoreMotion is a framework offered by Apple that used to process accelerometer, gyroscope, pedometer, and environment-related events. In this case, what I used is just:
1. Heading
As the most important parameter, Heading is an instance which returns a double datatype value that indicates the relative heading angle in the range from 0.0 to 360.0 to the current reference frame. Therefore, the user has to define a reference frame to make this function work. In this project, the program using a type property called xMagneticNorthZVertical for each frame updating. As a result, it defines the reference frame equals to 360.0/0.0 degree when a player using the backside of cellphone point to North


2. Gravity
Gravity instance will be a useful parameter to indicate if users are looking up or down. It returns three float-point value on three axis(roll, yaw, and pitch) that shows the current placing status of cell phone relative to gravity. We usually place the cell phone horizontally in the VR headset and make it screen face to users’ eyes, as the picture shows above. When users look down to their feet, it will return a positive, maximum number and return a minimum, negative number when user look up to their head
above on roll axis. Similarly, the yaw axis can be used to indicate how user wags their head from shoulder to shoulder


3. Rotation Rate
This instance returned the accelerating ratio in a short period on three axis and used for
emulation. For example, we can modify the coefficient of filtering due to the different ratio of how fast the user moves their head. Also, once if we want the sound object can be heard more accurately at the different distance, it’s not enough to have a facing angle to the conference. We still need to calculate the speed and range to build a virtual sound field that makes the auditory feeling from the different distance more authentic.

● AVFoundation
AVFoundation is a basic framework for the developer which can easily play, create, and edit the audio/video stream. In this project, we reinforce the spatialized experience of audio by using four classes:

1. AVAudioPlayerNode
This class is for opening the audio files that user packs into the project and scheduling the playback. In this project, except scheduling the playback files, it also used to define the output format of the node, which should be same to the input file.
2. AVAudioMixerNode
For the AVAudioMixerNode, its primary function is for mixing multiple input files to a single file. Usually, it used to unify the sampling rate, and down/up mix for the specific channel number. Since it conforms to AVAudioMixing protocol, we can use a particular structure to define the mixer node as a point in 3D Space in the mixer. For example, the mixer node can be treated as a point-source where our audio source comes from, by tracking our heads, we can use the relative data from Core Motion to move on the coordinate axis in a 3D virtual world. Just like how we moved a mono audio source around the listener in Unity.


3. AVAudioEnvironmentNode
This is a relatively new, and essential class in this project, which has multiple methods to
simulate a 3D audio environment and also conforms to AVAudioMixing protocol.
1) renderingAlgorithm


As its name, this enumeration defines how the application renders the audio per input bus. Currently, we used.HRTF which is a CPU-intensive algorithm that filtering the audio to emulate the real auditory experience.

But through what I tested, this backstage processing of rendering algorithm is not good. The sound is very ‘in-head’. So personally I would suggest using the HRTF rendering function we built at first here. And hopefully, Apple should improve this part of works in the future.

2) listenerPosition


Like how we defined the position of mixer node, aka audio source before. We can use
AVAudio3DPoint to locate our listeners’ position, which usually set up as (0,0,0), the base point in a 3D virtual world.

3) listenerVectorOrientation/listenerAngularOrientation

This feature is used to describe the orientation of our virtual listener with different data.
listenerVectorOrientation indicates a three-axis vector value that the virtual listener will face to a specific point in the coordinate plane and user should express this point like (0,0,-1), which is the default orientation that listener is looking directly along the negative z-axis.


(A helpful slide found online to describe the relationship between listener and sound source in this case)

4. AVAudioEngine
This is just a general node that performs audio input and output; once we were done all the process above, it only has to attach those nodes on this AVAudioEngine and start it.


At the end, you can try put your cellphone in the VR headset and run the application you build. You will hear that with your head moving, the sound object also adjust their position as the relative motion, and be heard binaurally through the rendering algorithm we built in the current frame.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.