Skip to main content

'Next Gen' Audio

Submitted by Anonymous (not verified) on
Forum

Just wanted to start this thread here up.

Eyepopping Graphics + Earbleeding Audio Design + Pantwetting Gameplay = Good.

What kind of advanced audio experiences can be had - or are on the horizon for games?

Audio isnt my area of total expertise so anyone out there who could shed a light on this area - go for it! Im really interested in ideas and effects for a more immersive sound experience to accompany some bleeding edge looking environments and characters.

Submitted by lorien on Sat, 27/08/05 - 9:31 PM Permalink

A dangerous topic with me on sumea HazarD [:)]. I want 3 things from game audio:

sub-millisecond latency (like pro audio software on Linux and MacOS has had for quite a while). This is simply impossible on Microsoft OSs- they aren't RTOS (real time operating systems), nor are Linux or MacOS for that matter, but they are close enough.

real-time low-latency scriptable sound synthesis i.e. making sound on the fly rather than playing back recordings. This will be a similar difference to what happened to graphics when they went from sprite-based to full 3d.

real geometry support: IMHO EAX is based on a completely broken series of dodgy hacks piled on top of each other.

My MSc research is largely about the first 2.

Submitted by Kalescent on Wed, 31/08/05 - 2:38 AM Permalink

Can you elaborate on how this makes my sound experience better in a game environment Lorien ?

Elaborate as much as youd like, fill pages if you dont mind, I'll read it [:)]

Im quite interested in developing the sound / graphics relationship.

Submitted by lorien on Wed, 31/08/05 - 9:24 AM Permalink

I could fill pages and pages you know, but I won't [:)] I'll take each point separately, with pauses between each because I'm rather busy this week...

The overall goal of all this work is to bring to game audio some of the characteristics of commercial and research music and audio software- my real passion is intelligent interactive music systems (software you can jam with in realtime) and computer enhanced musical instruments.

1)Latency
Latency is lag between the time you expect something to happen, and the time it actually happens. In game audio it's often between 50 and 100 milliseconds, which is quite noticeable. Some of the reasons for the lag lie deep in the windows kernels, and I think the audio section here isn't the right place to go into detailed explanations of why windows is so bad at it.

This lag is one of the things that makes game audio seem fake, in physical reality we are accustomed to a near instantaneous response between performing an action that creates sound, and the sound reaching our ears (actually latency is affected by distance, altitude, humidity, and other factors). It is also bad because it makes sophisticated interactive music systems very difficult to pull off (i.e. it's very hard to closely tie music and gameplay together effectively), and so it makes games not as much fun to work with as they could be for musicians.

One of the clearest demos of windows latency can be done in any realtime audio package (adobe audition, saw, logic, cubase vst, etc): put an effect on a track and change some effect parameters (ideally with a slider) while it is playing- better still, try doing it while copying a really big file and playing a DVD [:)]. You will notice a sizeable lag between moving the slider and hearing the result, and you may get audio breakup. Do the same thing on MacOSX or a specially prepared Linux and the change is instant- timing accurate to 2/1000ths of a sec is quite possible, 1 millisecond is very difficult, but almost possible, and as machines get more CPUs it is likely to end up quite possible. This timing would make games far easier for audio people to work with, and is likely to get more of the interactive music systems and audio synthesis geeks interested in doing game work.

The attitude of game developers seems to be that audio is good enough running at 30-60 FPS. I'm talking about audio running at around 1000 FPS, and those are not normal game FPS: game FPS vary up and down, but audio needs each frame to take a constant amount of time, so this 1000 FPS means each frame has to have finished before it's millisecond is up (because then the buffers get swapped and it gets sent to the soundcard, which makes a hideous noise if you overrun). Of the next-gen consoles this should be possible on the PS3 at least (I hope) [:)]

When you design software for low end RTOS timing like this it doesn't mean it won't run at higher latencies, so it can be cross-platform. But software that isn't designed for low latency generally needs a complete re-design and re-write to make it work with in a low latency environment.

To summarise: low latency audio will make games more immerse. It will make a game respond much more like a musical instrument. This latency requires very tightly disciplined software development, as well as having an effect on system performance, but this effect is minimised by multiple and multicore CPUs.

Submitted by Kalescent on Wed, 31/08/05 - 10:32 AM Permalink

Nice! [:)]

Okay with latency in mind... What kind of application can this be put to good and obvious use in a typical game environment - am I right in saying that what your talking about would allow accurate reproduction of the sound of every bullet being fired from an m-16. rather than just a prerecorded loop that cuts off when I stop firing.

And when and if possible it is slowed down, everything about the gun and the sounds it emits could be accurate to within 1-2 1000ths of a second?

Being a fan of playing games with a massive home theatre - earth shaking bass, bullet ricochets whistling past my ears etc - Im not sure how much difference that pinpoint accuracy is going to make to the whole experience.

Hearing would be believing thats for sure - but I do know that if and when it happens, the implications it has on animating the m16 for example, and having particle effects emmited all in harmony with the sound and being accurate to within that 1000 fps mark... well golly [:0].

*boggle*

Excellent - hit me with more. [:)]

Submitted by lorien on Wed, 31/08/05 - 10:40 AM Permalink

You got it exactly [:)] The particle effect your talking about is called Granular Synthesis http://en.wikipedia.org/wiki/Granular_synthesis (well worth a read, and you'll get some ideas of what I'm meaning by realtime scriptable audio synthesis).

This is quite doable on Linux, MacosX, BeOS, QNX, and IRIX right now, and has been for years [:D] (at a cpu cost that's prevented it for games though). It's also been 2-4 milliseconds per frame for Linux until recently.

You are likely to find this interesting too http://en.wikipedia.org/wiki/Real-time

Submitted by lorien on Thu, 01/09/05 - 9:51 AM Permalink

Apologies for quoting myself- it feels rather odd... [:)] I'm just low on time.

2)Realtime scriptable audio synthesis

The following is from the proposal for my masters degree candidature:

quote:

One of the key differences between actual reality and the virtual realities of computer games is the comparatively static nature of the virtual realities. In graphics this can be exemplified by the BSP (binary space partition) tree, which needs re-compiling when any geometry changes. In audio this static nature comes about from the use of a small number of samples to represent a large number of events. This leads to the environments simulated in games seeming fake, and hence makes them less immersive.

The best solution to the audio side of this problem involves replacing direct sample playback with real-time synthesis, generating sounds unique to each event. Real-time synthesis takes considerably more CPU time than sample playback, but the results are well-worth it; for example instead of hearing exactly the same sound every time a sword hits a shield, the sound gets created from, or affected by, parameters such as the materials and size of the sword and shield, the position in each at which the collision occurs, the amount of force involved in the collision, etc. Real-time synthesis can bring some of the infinite variation of sound that occurs in reality into virtual reality.

Common techniques for audio synthesis include:
Additive: summing sine waves with potentially time variant frequency and amplitude to obtain a desired spectrum.
Subtractive: creating a sound by carving away undesired parts of a spectrally rich source with filters.
Distortion: frequency modulation, amplitude modulation, wave-shaping and discrete-summation all have the characteristic of producing spectra dependant on the amplitude of the input signal.
Wave-table: the combining of many real-time manipulated pre-recorded sounds into an instrument. This had been the dominant form of synthesis used by musical instrument manufacturers over the last 10 years.
Granular: mixing together a large number of grains or quanta of sound. Resembles a particle system.
Physical modelling: modelling the characteristics of a sound-producing object through equations.
Spectral manipulation: analysing a sound into a frequency/amplitude graph of partials using a transform such as the Fourier, performing manipulations on the analysis, and re-synthesising (the inverse of the transform).

Current games for the most part do not use real-time audio synthesis. If synthesis is used it is done offline, and the results stored in samples, which are simply played back in response to events.

The most spectacular (and most CPU intensive) of the synthesis techniques are

Physical modeling- with the sword and the shield example above, representing the acoustic properties of the sword and the shield, and the precise effects of the collision, mathematically in code that generates a soundwave that is sent to the soundcard as it's being calculated [:)]

Spectral manipulation lets you do beautiful and crazy things like truly morphing between sounds (spectral morphing), taking the parts of a voice that make it recognisable as speech, and applying it to another sound like fire- the result is quite literally talking fire (cross-synthesis), changing the length of a sound without changing it's pitch, changing the pitch of a sound without changing it's length, and many others. It's too complicated to me to explain how this magic works here atm

If you are interested in audio synthesis, you need to check out the software synthesis and linux distro sections on http://linux-sound.org .

The following is a quote from a research paper by my supervisor (Dr John Rankin) and myself, that is published in the proceedings of the ADCOG 2004 conferece, City University, Hong Kong, where I presented it. ADCOG is an anual games research conference http://www.adcog.org . I've included the references for the quoted section. Edit: just noticed the adcog site is down. Here's the google cache
[code]
Paste this into your browser, sumea refuses to link this url properly
www.google.com/search?q=cache:3bA3Sf1j8qwJ:www.adcog.org/+adcog&hl=en&i…

[/code]
quote:

Scriptable Audio Synthesis for Computer Games

Audio in computer games lags far behind graphics in every aspect. To someone from the computer music field this seems very odd, as computers have been used to synthesise sound since the days of punched cards, and leading companies such Hewlett-Packard, Bell Telephone Laboratories and IBM have conducted years of research on computer synthesis of music, and indeed have had composer-programmers on staff for the purpose of research into digital audio. Thus there is a rich literature of computer audio research, much of which relates to real-time interactive systems (which of course is what games are) just waiting to be made use of. To-date the results of this extensive research are not being used to the best advantage in the games industry.

Occupying a key place in this literature are the MusicN type audio synthesis languages [1]. The first MusicN language was Music1, developed by Max Matthews at Bell in the late 1950s. Matthews' made several more revisions (Music2, 3 and 4) before giving the source-code to two American universities- Princeton and Stanford, where audio synthesis research proliferated (as it does to this day). Each customised the software to their needs, resulting in Music6 and Music10 at Stanford, and Music4B and Music4BF at Princeton. Matthews also continued his work at Bell, producing Music5. The Princeton branch is the distant ancestor of the most widely known modern implementations: CSound [2] from the MIT Media Lab and SAOL [3] which forms part of the MPEG-4 standard, both of which were designed and originally implemented by the composer-programmer Barry Vercoe. MusicN languages are for the most part specialised scripting languages with high performance compiled Unit Generators, the arrangements of which are determined by scripts. These scripts are often called Orchestras and/or Scores, an orchestra containing Instruments (sound generators) made up from unit generators connected via Cables, and a score containing instructions for the playing of each instrument [4].

In this paper we are primarily concerned with the computer synthesis of sound effects rather than music within games.

In physical reality sound is produced by the vibration of objects. There is an inseparable coupling between the object and sonic signatures it produces when induced to vibrate. In current virtual reality there is no such coupling, and the sounds a virtual object appears to produce have no real relation at all to the object. A pre-recorded sound is merely triggered by an interaction. We propose that this is a key factor in game-audio seeming fake.

In most MusicN languages the orchestra must be given the instructions of "when to play" and "how to sound" from the score. It is the latter that is missing in game-audio, the "orchestra" is stored samples, and the "score" is the stream of game-play events and parameters. Games make great use of the events for triggering audio, and minimal use of parameters.

Current generation computer games for the most part use fairly direct sample (pre-recorded sound) playback in response to events, the most basic being a collision. It is fairly direct because the actual sound played for an event may be selected by means of a probability distribution, and the selected sound may have comparatively simple transformations applied. If synthesis and/or complex transformations are used at all it is during the creation of the samples, where it is frozen in place.

The score for a piece of music created with a MusicN language is also made up of events, which may be dynamically generated through algorithmic procedures and/or interaction with a performer via a transducer (an "interactive music system"- in many ways very similar to a computer game).

Thus one way to improve audio in games is to synthesise sound in real-time in response to game-play events, and such a synthesis system must be fully scriptable so that someone trained in music programming (as is commonly taught in university music departments) may take full advantage of the system. This is part of the rationale for SAOL. However it is common for games to embed a scripting language for controlling entity behaviour, and this behaviour invariably involves the production of sound. Using SAOL with its orchestra and score paradigm produces a weaker coupling between object and sound than is desirable.

References

1.Dodge, C; Jerse, T A: Computer Music. pp 12-13 Schirmer Books, New York, 1985.
2.Vercoe, B: The Public CSound Reference Manual. Available online at http://www.lakewoodsound.com/csound/hypertext/manual.htm
3.Scheirer, E D; Vercoe, B: SAOL: The MPEG-4 Structured Audio Language. Computer Music Journal 23:2 pp31-51. MIT Press, Cambridge, MA. 1999.
4.Dodge, C; Jerse, T A: Computer Music. pp 12-193 Schirmer Books, New York, 1985.

The main change in my work since then is the discovery that scipting languages can be made that can run multithreaded and at low-latency, which has a bunch of implications for much better sound still.

edit: here is an updated URL for the CSound manual http://www.csounds.com/manual/

Submitted by lorien on Tue, 06/09/05 - 8:31 AM Permalink

This one is going to take me awhile I'm afraid- it's full of the physics and psychophysics of sound. It takes time for me to figure out how to explain it concisely without using loads of audio technobabble...

3)Real geometry support

This is where the real processing power is needed [:)] To understand why geometry is important for audio you have to understand how current 3d game audio works: it's a crude physical model. Your brain uses "cues" (really these are hints) from the way sounds arrive slightly different at each eardrum, for example a close directional sound pointing straight at your right ear is going to be muffled by your head when it reaches your left eardrum, and the speed of sound dictates that there is a slight time delay before it gets there too. Expand into 3 dimensions and add some other cues as well and you have a basic understanding of how we percieve the position of a sound.

How game audio models these cues is the Fast Fourier Transform (FFT). 3d sound in games is a kind of spectral manipulation synthesis- in an anechoic chamber (a room with walls that absorb all sound without a single echo) they take a physical dummy head, put small microphones in its ears, and start recording and cataloging "impulses" (short, percussive type sounds with large frequency range) played in positions all around it. By figuring out the differences between these recordings and the original impulse you get much of the information you need to position any sound anywhere using software. The FFT and inverse FFT are the magic that does this, it is hardware accelerated in your soundcard. This specific application is called the Head Related Transfer Function (HRTF).

That all seems nice and sensible, but here are some fundamental flaws in this model:

*It only works properly with headphones. With speaker and room acoustics added in 3d sound gets decidedly dodgy.
*Everyone has a different head, and these cues are so tiny that the differences between heads can lead to large differences in audio- you are expecting different cues (from your experience in the real world) from the ones the head modeled by the software provides.
*It is impossible to ever record enough impusles, so an averaging (interpolation) process is used for positions without an impulse, leading to innacurracies.
*It takes no account of echoes and the sense of acoustic space (the feel you have for the environment you're in from the way it reflects sounds back at you)
*The number of simultaneous audio channels is limited largely by the number of FFTs and IFFTs that can be performed in real-time.

EAX (Creative's Enviromental Audio Extensions) are an attempt to solve the problem of acoustic space that's been bolted on top of this model, and they will be my next post.

Submitted by Kalescent on Tue, 06/09/05 - 11:53 PM Permalink

This is all great stuff Lorien - thanks for taking the time to write this stuff down. Spectral manipulation sounds like a fun thing to play with..... And the sword and shield example above really is the guts of the stuff that im thinking of - and was the reason for bringing this topic up - Good to know its all coming along.

Real geometry support - aside from the accoustic space and echoes flaw and the more channels, what will an overall tightening up of the other flawed areas do for my sound experience while playing a game, how would they effect my ears and what I hear?

*getting into knowledge sponge mode* [:)]

Submitted by souri on Wed, 07/09/05 - 12:20 AM Permalink

There was a website I visited a while ago where you could type in a phrase and the news presenter (it looked like a flash animation) would speak it out. It did an unbelievable job too, far better than your standard speech ability in Macs or the extension from Microsoft. If anyone knows which site I mean, please post it here [:)]

Anyway, my contribution to this thread is that I'm surprised that believable speech synthesis for games still isn't common at all. We still have NPC's spitting out the same repetitive sentences when you try to interact with them. Can you imagine the improved gameplaying experience (and disc space savings) in RPG's and games like Grand Theft Auto where NPC's responses are only limited by the number of sentences written out for them by the developer? Couple that with the AI/interactivity experience of [url="http://www.interactivestory.net/"]Facade[/url], and we could be on some really fantastic immersive games. [:)]

Submitted by lorien on Wed, 07/09/05 - 6:18 AM Permalink

HazarD: no problem, imho this stuff needs to be much more widely understood (as does sound in general) [:)]
Spectral manipulation is my favourite form of sound bending. I feel physical modeling is the ultimate goal of game audio. All the forms of synthesis can be combined into hybrids, and they all work perfectly well alongside sample playback too.

Real geometry support leads to much more than echoes and acoustic space- it has the potential to entirely eliminate the HRTF (the FFT is lossy). That's why I'm going to explain EAX first.

Souri: Speech synthesis is so insanely complicated I wasn't going to bring it up here. Think along the lines of trying to mimic the waveforms produced by the vocal chords, nasal cavity, mouth, tongue and lips. Speech is one of the most complex sounds, and it's one we are all very good at telling the real from the fake. It would be so cool to be able to do well in games...

Submitted by Shplorb on Thu, 08/09/05 - 9:51 AM Permalink

Wow, you guys are wanting way too much from game audio. Although it sounds interesting, I seriously doubt that procedural synthesis of sound effects will be viable. It just needs too much physics and will most probably be too hard for sound designers to control to ensure that they always get the sound that they want.

Take a look at FMOD Ex to see where I think next-gen sound is going. In case you don't want to look at it, basically next-gen game sound is going to be about DSP, mix routing and real-time data-driven design.

Granular synthesis is I guess something that groovyone has discussed with me recently, except we've labelled it "composite sounds". He'd like to use it for bullet ricochets to reduce SRAM usage whilst maintaining or improving sound variation, but I don't think it would really be viable on the PS2 without hacking FMOD or replacing it with my own low-level sound engine.... something I don't have the time for and something I don't want to do when there's much more interesting work to be done at the higher-level.

As for latency, of course it will be reduced on next-gen hardware, purely because of the multi-core nature and increased speed of the systems. I think the PS2 has a 12ms mixbuffer, but don't quote me on it. Since the PS2's sound chip (SPU2) is essentially two PlayStation sound chips (SPU) it is rather limited in what it can do, compared to the Xbox with it's DSP and AC3 encoder.

Improved reverb (can you say "DSP"?) and propagation (can be currently simulated with low-pass filtering) are two things that I think will add to the realism of next-gen sound, but at the same time I also wonder just how much realism you want. Movies aren't real, they exaggerate sounds to play your emotions. That's why I think that the real direction of next-gen sound is going to be in giving more control to sound-designers, or at least that's what groovyone and his cohort have led me to believe with their constant feature requests. =]

In the current game that groovyone and I are working on, we're implementing something we're calling 'mixgroups', which allow us to have different mixes for the sounds in the game to suit different scenarios like driving versus cutscenes versus a gunbattle. That sits on top of an existing system that allows designers to group contextually relevant sounds and choose what sort of sounds they are and their playback parameters in real-time whilst the game is running. We can already produce varying sounds from things like footsteps and collisions through the use of different samples of variations of the sound and playing them back with varying volume, pitch and bandpass filtering and making use of crossfaders that are fed values from the physics system.

Besides the increased processing power that will allow for more channels, DSP effects and flexibility in mixing (because it's going to be all done with software), the really big bonus of next-gen platforms is the drastically increased amount of RAM we'll have available for samples (I would love to have 8MB or more!) that can be longer and have higher sampling rates and the increased storage I/O for streaming.

Anyway, that's my $0.02 rant for now.

Submitted by groovyone on Thu, 08/09/05 - 10:29 AM Permalink

No WAY!

Next Gen is all going SID FARMS with DSP FARMS to support them!!...
128 channels of SID Madness + 16 effects busses! We can digitize speech into 4bit pulse data to be played back on SID chips!

No CPU hit! True synthesis!

Long Live SID!!!

Bleep bleep bleep!

:)

Oh.. as for 5.1 DTS sound, we can just pump banks of SIDS out to different speakers :).

Submitted by Shplorb on Thu, 08/09/05 - 10:45 AM Permalink

quote:Originally posted by groovyone
Next Gen is all going SID FARMS with DSP FARMS to support them!!...
128 channels of SID Madness + 16 effects busses! We can digitize speech into 4bit pulse data to be played back on SID chips!

Oh.. as for 5.1 DTS sound, we can just pump banks of SIDS out to different speakers :).

My god man! Imagine how many filter caps you'd need! Maybe that explains why those PS3 alpha kits are so frickin' huge... they aren't using surface mount components yet. =]

Submitted by lorien on Thu, 08/09/05 - 10:30 PM Permalink

[:D] I've warned before that I'm an audio freak...

For those who think this is too much just wait till you hear the geometry stuff [:)]

Also Intel and AMD have been making noises about putting 32 CPU cores on a single chip in future. Running high end audio stuff on one or two of these cores really won't make much difference. And I suspect the Cell SPEs could do a hell of a job at audio synthesis.

As for it being too hard to control for sound designers, physical modelling isn't too hard for musicians... Commercial physical model synths have been around for 10 years or so, and now on MacOS we are seeing commercial low-latency physical modeling softsynths. Even guitarists can manage them [:D] I don't really understand this point, because synths have been used for music for years and years, and synthesis is what originally made game sounds.

SID did rock- imho it was the Right Way (tm) to do game audio: put a close to pro level (for the time) synthesiser in a game.

I know you were joking groovyone, but multicore CPUs will allow what you describe, all in realtime, while all the rest of a game is running in parallel, and the Cell is a "streaming media processor" i.e. the SPEs could be looked at as being independent floating point vector DSPs.

I know the sort of things that the new fmod can do (it includes putting software effects on sounds in realtime, geometry support in the software mixer, etc, etc). IMHO it's not nearly enough. You can read about some of the new features on the mainpage of http://fmod.org .

My take on scriptable audio synthesis is that it is entirely about giving more control to sound designers. Just a sound designer with some different skills than you commonly see atm. The "scriptable" stuff is to make it useable by audio people rather than dedicated programmers. Musicians have been using this type of tech for a long time.

I completely agree that exact realism isn't always desireable, synthesis allows to to do all sorts of crazy exaggeration, it's just as easy to provide exagerrated parameters to a realistic physical model as it is to a crude one, just the results will be better.

I think aspects of what I'm talking about will make it into next-gen, and more aspects will make it into the generation after that.

I also think that if people start doing full-on audio stuff in software in games, the hardware designers are likely to notice (we already have people doing audio on GPUs), and actually make some sensible soundcards (though I'm sure Creative would have their normal try at buying out anyone with better tech and closing them down).

Thanks guys, it's good to have audio people other than me talking here [:)]

Souri: I think that software must be using a database (probably of the building blocks of words rather than words themselves), and combining spoken audio together in realtime. Not speech synthesis as such, and probably not something that you could get to sing a song. Very cool though.

Submitted by groovyone on Thu, 08/09/05 - 11:50 PM Permalink

Physical modelling and synthesis is all great, and I am sure there may be some games who go as far as to use this for certain spot "effects" but as a main stream thing it's going to require oodles of time to properly program/script them to do anything useful. Not only that, but how many sound events occur in real life and are needed to be modelled in game? How much time will it take to model something to the point that it's almost recognizable? It only takes a sound designer a few hours to build a complex sound. To generate something by synthesis that sounds anything like it would take days if not a week, and they may still not even come close.

I agree that For instruments, synthesis is useful. Imagine being able to change their filters, timbres in real time to help change the mood of the music. (ala SID)

All I say for now is give us:
1) Memory
2) A sound DSP CPU for software DSP
3) HardDrive for buffer/streaming

Submitted by lorien on Fri, 09/09/05 - 1:05 AM Permalink

I've never suggested that a sound designers job should be to model everything sound by sound... I've more been thinking along libraries of tweakable scripts, which a sound designer can add to and modify if they choose to.

Most of they synth techniques are nothing like as processor hungry as physical modelling.

We have different goals I think- yours is directly making games, mine atm is finding different ways to make games (a research degree), hence I'm aiming high and immediate commercial application is not a huge concern.

This engine/scripting language I'm making is a professional sound library that can be used in games (much of it will be open source btw). It's being designed to be good enough to use in creating pro realtime music software, and if people making commercial games don't want this tech yet that's fine by me of course [:)] I'll be using it in art and indy games. There is still probably 1 1/2 years of work before it's ready for release anyway- to get RTOS like timing you have to start from a very low level indeed [:(] and I finish my MSc halfway through 2007.

I think what you are saying is you want things that will make the job you're doing now easier. I understand that completely and I sympathise for anyone trying to get good sound out of the current consoles except the xbox (which has simple audio scripting btw). I think FMod is fantastic for getting the day to day needs of commercial game dev looked after very quickly. But from my own experience with FMod 3.x I know if you want to do something that is out of the ordinary, all the work the Firelight put into making FMod so easy really gets in the way (i.e. you end up fighting it and hacking around some of its features rather than using it).

I'm looking at doing the job differently, and where I'm finding inspiration isn't games, it's pro and research audio and music software.

Submitted by Shplorb on Fri, 09/09/05 - 11:39 AM Permalink

quote:Originally posted by lorien
to get RTOS like timing you have to start from a very low level indeed
I think you're overblowing the latency thing way too much. At 60Hz you have a 16.7ms mixbuffer. There's no point in reducing latency below the point at which you are triggering sounds, which is purely dictated by interval at which the game state is updated. Games with parallelised rendering pipelines are already one or two frames behind the current game state, so really there's no issue with syncing sound to gfx.

quote:I sympathise for anyone trying to get good sound out of the current consoles except the xbox (which has simple audio scripting btw).
Simple audio scripting? I don't want to sound rude and arrogant, but do you even know what you're talking about? Are you talking about XACT?

quote:I think FMod is fantastic for getting the day to day needs of commercial game dev looked after very quickly. But from my own experience with FMod 3.x I know if you want to do something that is out of the ordinary, all the work the Firelight put into making FMod so easy really gets in the way (i.e. you end up fighting it and hacking around some of its features rather than using it).
Which is why I was talking about FMOD Ex. =] MacOS X's CoreAudio and Audio Units are pretty much the same thing.

Submitted by lorien on Fri, 09/09/05 - 9:27 PM Permalink

quote:Originally posted by Shplorb
I think you're overblowing the latency thing way too much. At 60Hz you have a 16.7ms mixbuffer. There's no point in reducing latency below the point at which you are triggering sounds, which is purely dictated by interval at which the game state is updated. Games with parallelised rendering pipelines are already one or two frames behind the current game state, so really there's no issue with syncing sound to gfx.

Fine, no problem [:)] I just think otherwise... Latency is one of the cues used to determine the position of a sound, and I'm working in multiple threads anyway, so it's independent of game state i.e. I have audio rate (processing samples using SSE vector operations), realtime control rate (single float processing, 1 per frame), and game control rate (completely asynchronous to audio), and latency on the realtime audio and control rates is a killer. The thing about 60 Hz on windows is you can't do it. Sure you can get way over 60fps, but just try to get each frame to take a constant amount of time, so you can actually get that 16.7 msec latency. Also while I agree ~17 milliseconds is acceptable for triggering samples that's not what I'm trying to do.

quote:
Simple audio scripting? I don't want to sound rude and arrogant, but do you even know what you're talking about? Are you talking about XACT?

I'm talking about AudioVBScript, which is also part of the DirectX sdk and DirectMusic Producer. It's simple because it's a toy language and doesn't let you touch most of DirectSound/Music, but it seems to be designed to give more control to sound designers: the idea seems to be that programmers call script functions which are written by a sound designer. It doesn't do synthesis at all.

[snip my stuff about fmod]
quote:
Which is why I was talking about FMOD Ex. =] MacOS X's CoreAudio and Audio Units are pretty much the same thing.

Stick with FMod then! [:)] All I'm saying is it's not for me (nor is Core Audio for that matter). You won't be getting any marketing division trying to push synthesis on you (at least not from me), this software is going to be free, people will be able to use it if they want, and if they don't it's no skin off my nose.

But if you would like to experiment you will be able to (actually you can now, though with a system that is not designed for games at all- CSound has become a VST plugin, FMod Ex supports VST plugins, therefore you could probably use CSound in a game if for some strange reason you felt like it).

Submitted by Mick1460 on Sun, 18/09/05 - 3:51 AM Permalink

LOL

Ok, here is what I want with next-gen...

1 - More bloody disk space so I can stop listening to 22kHz sample rates.

Thats pretty much it sorry guys. There is no way that we will have realistic voice synth anytime soon (thank God to be honest), nor real-time sound generation (ie, when the player knocks a can off a wall the engine GENERATES that sounds rather than playing back a pre-recorded .wav file. To me, its the same as pretty high-poly models, they are nice took look at but it adds NOTHING to gameplay. I think that the only sound element, apart from surround sound, that will enhance gameplay is disk space.

Submitted by pb on Sun, 18/09/05 - 9:57 PM Permalink

quote:Originally posted by Shplorb
I think you're overblowing the latency thing way too much. At 60Hz you have a 16.7ms mixbuffer. There's no point in reducing latency below the point at which you are triggering sounds, which is purely dictated by interval at which the game state is updated. Games with parallelised rendering pipelines are already one or two frames behind the current game state, so really there's no issue with syncing sound to gfx.

Yeah, I completely agree with this point, glad someone typed it out. The order in which parts of the game state get updated within a frame has nothing to do with real-time. That's why both graphics and audio treat the frame as an quanta, firing off everything at the same time at the end of the frame.

Games with floating frame rates have all sort of additional sync issues that dwarf a few extra milliseconds of latency. There's the fact that the physical time that a frame or two of lag produce changes with the frame rate. In addition the update code can only scale the time for a state update by that of the previous frame, not the current (since it doesn't know how long that will take to render - there's just an implicit guess that it will be as long as the last frame).

But floating frame rates blow - I really hope that the TRC/TCR requirements for next-gen mandate 60Hz (or at least fixed frame rates). Might screw some PC developers but if they want to do console stuff they should do it right or stick to PC.

As for what HW features would make good audio, I reckon disk space, 6 channel output and a ludicrous amount of programmable DSP power with the bandwidth to keep it processing.

pb

Submitted by lorien on Mon, 19/09/05 - 5:09 AM Permalink

I've actually been talking about running this stuff in a separate thread all along (means it runs at a frame rate completely independent to the game frame rate).

Mick, my favourite games are actually Elite and Nethack [:)] It's very rare for me to play a AAA title. I whole heartedly agree with high poly models and audio synthesis doing nothing for (normal) gameplay. IMHO for games they are fluff. I've been interested in sound and audio synthesis for much longer than I have in games, and I plan to use this engine for plenty of things which aren't games too [:)]

Ludicrous amounts of disk space and DSP power are for me pb [:)]

Submitted by pb on Wed, 21/09/05 - 5:01 AM Permalink

quote:Originally posted by lorien

I've actually been talking about running this stuff in a separate thread all along (means it runs at a frame rate completely independent to the game frame rate).

But the events that trigger game sounds all quantize to the game frame rate (or at least they should since the physical order of update within a frame is meaningless).

The best you could do is put the game logic on a seperate thread (pure server style) and run as fast as you can, then put graphics and sound on their own thread so they can pick out discreet frames from the game logic. If you can update game logic 600 times per second your sound thread could grab each update and the graphics thread might only get 1 out of 10 (or however fast it can manage). But that's a lot of processing to burn to shorten the latency by an amount that will be different to the latency of the graphics pipeline.

These days game logic consumes a larger portion of the frame than it used to. This is largely a result of DMA driven graphics acceleration shortening the render side whilst physics, AI, big complex worlds and lots of random access on architectures that have massive CPU:memory speed ratios result in slower game update. That means its not practical to have an ultra fast game update thread. If it were you would've seen more (any?) PS2 games doing all their game logic on the IOP and leave the rest for rendering.

pb

Submitted by lorien on Wed, 21/09/05 - 9:36 AM Permalink

A big concern for me is a hi-resolution "control rate" (k-rate in CSound terms). Yes, of course the game sounds getting triggered will quantise to the game frame-rate, but things like fades, envelopes, and synth parameters should often (imho almost normally) be independent from game FPS anyway. A bunch of functions get triggered by the game logic thread, and continue on their own in the audio thread.

With a control rate much below 1000 fps it's fairly easy to hear quantisation artifacts in audio parameters, hence the non-realtime softsynths don't have a control rate at all- it's the same as the sampling rate.

No argument about CPU/memory bandwidth issues, afaik it shouldn't be so much of a problem on the PS3. In the normal computer world things are going NUMA (non-uniform memory architecture) to solve the bandwidth problem, this is how Sun's Opteron based servers work now. NUMA means each CPU gets it's own memory, and whilst every CPU is able to use every other CPU's memory, it is much faster to use their own, so they heavily favour it. NUMA is part of the Linux kernel right now (it's one of the things that got SCO so upset).

Submitted by Brett on Wed, 21/09/05 - 1:06 PM Permalink

FMOD Ex sound designer tool is putting all of the control into the sound designer's hands. Complex audio models can be built in it, and the programmer interaction is minimal.

The low level FMOD Ex engine is extremely advanced, the DSP engine can do virtually anything, including realtime sound synthesis (which there are demos for). If you only take a quick look at the API you might it looks simple to start with (actually that's our aim), but dig deeper and there is a whole range of possibilities available.

We are only half way into our feature list, we've built it to be very flexible, so I really think it is well on its way to providing 'next gen' audio. I'm pretty excited by it :)

Submitted by Brett on Wed, 21/09/05 - 1:31 PM Permalink

quote:Originally posted by Shplorb

I think the PS2 has a 12ms mixbuffer, but don't quote me on it. Since the PS2's sound chip (SPU2) is essentially two PlayStation sound chips (SPU) it is rather limited in what it can do, compared to the Xbox with it's DSP and AC3 encoder.

The granularity on the ps2 is blocks of 256 samples @ 48khz so the minimum latency is 5.33ms. We usually mix on most platforms now using a 5ms buffer so i don't think latency is an issue except on windows and linux, but we find 50ms is generally acceptable and stable enough these days. Lorien has mentioned 'latency' but i think he actually means 'granularity', which is a trivial matter to solve and not always nescessarily a problem that needs to be solved. (ie a car engine pitch bending at 60hz is perfectly smooth to the majority of ears.)
pb is right about the audio syncing to the framerate, and especially with things like weapon sounds, even the graphics can take a few frames to actually kick in and display the firing graphics anyway.

As for the PS2, it is fairly low spec now, but we're really pushing it to the limit. We're hoping to allow a benchmark of 32 voices with multiple DSP filters per voice and keep the EE cpu usage under 5%. This is thanks to VU0.

Submitted by lorien on Thu, 22/09/05 - 3:35 AM Permalink

Get that marketing out there Brett (just teasing) [:)] Good to see you on sumea btw.

I'm talking about both latency and granularity, they are different, but related. One kind of latency is a lag between input and output, this is what prevents people from using windows as a guitar effects processor for example. Big lag between playing a note and hearing the results. Granularity is the control rate stuff. Latency and granularity are related- certainly you can have 1 msec granularity and 1 second latency, but you can't really have have 1 msec latency and 1 sec granularity...

I'm not bagging fmod at all, I truly think you guys have done great work, but I'm not going to use fmod in my work (no matter how much the api has improved) unless it goes dual license (open source and commercial). Free binary for non commercial use doesn't cut it with me I'm afraid. I remember from a discussion we had on the fmod forums quite a while ago that you really don't want to do this.

Submitted by Brett on Thu, 22/09/05 - 4:38 AM Permalink

quote:Originally posted by lorien

One kind of latency is a lag between input and output, this is what prevents people from using windows as a guitar effects processor for example.

I don't know about that, we have a 'record' example that records, processes the audio and plays it in about 3ms turnaround using the asio output driver for realtime voice fx. That is pretty good for user level hardware.

quote:
Free binary for non commercial use doesn't cut it with me I'm afraid. I remember from a discussion we had on the fmod forums quite a while ago that you really don't want to do this.

It doesn't really bother me, but i don't see why having the source is so nescessary. Would you use memcpy if it was closed source? Most people wouldnt touch the source because they don't need to, the interface does everything that is needed.

Posted by Anonymous (not verified) on
Forum

Just wanted to start this thread here up.

Eyepopping Graphics + Earbleeding Audio Design + Pantwetting Gameplay = Good.

What kind of advanced audio experiences can be had - or are on the horizon for games?

Audio isnt my area of total expertise so anyone out there who could shed a light on this area - go for it! Im really interested in ideas and effects for a more immersive sound experience to accompany some bleeding edge looking environments and characters.


Submitted by lorien on Sat, 27/08/05 - 9:31 PM Permalink

A dangerous topic with me on sumea HazarD [:)]. I want 3 things from game audio:

sub-millisecond latency (like pro audio software on Linux and MacOS has had for quite a while). This is simply impossible on Microsoft OSs- they aren't RTOS (real time operating systems), nor are Linux or MacOS for that matter, but they are close enough.

real-time low-latency scriptable sound synthesis i.e. making sound on the fly rather than playing back recordings. This will be a similar difference to what happened to graphics when they went from sprite-based to full 3d.

real geometry support: IMHO EAX is based on a completely broken series of dodgy hacks piled on top of each other.

My MSc research is largely about the first 2.

Submitted by Kalescent on Wed, 31/08/05 - 2:38 AM Permalink

Can you elaborate on how this makes my sound experience better in a game environment Lorien ?

Elaborate as much as youd like, fill pages if you dont mind, I'll read it [:)]

Im quite interested in developing the sound / graphics relationship.

Submitted by lorien on Wed, 31/08/05 - 9:24 AM Permalink

I could fill pages and pages you know, but I won't [:)] I'll take each point separately, with pauses between each because I'm rather busy this week...

The overall goal of all this work is to bring to game audio some of the characteristics of commercial and research music and audio software- my real passion is intelligent interactive music systems (software you can jam with in realtime) and computer enhanced musical instruments.

1)Latency
Latency is lag between the time you expect something to happen, and the time it actually happens. In game audio it's often between 50 and 100 milliseconds, which is quite noticeable. Some of the reasons for the lag lie deep in the windows kernels, and I think the audio section here isn't the right place to go into detailed explanations of why windows is so bad at it.

This lag is one of the things that makes game audio seem fake, in physical reality we are accustomed to a near instantaneous response between performing an action that creates sound, and the sound reaching our ears (actually latency is affected by distance, altitude, humidity, and other factors). It is also bad because it makes sophisticated interactive music systems very difficult to pull off (i.e. it's very hard to closely tie music and gameplay together effectively), and so it makes games not as much fun to work with as they could be for musicians.

One of the clearest demos of windows latency can be done in any realtime audio package (adobe audition, saw, logic, cubase vst, etc): put an effect on a track and change some effect parameters (ideally with a slider) while it is playing- better still, try doing it while copying a really big file and playing a DVD [:)]. You will notice a sizeable lag between moving the slider and hearing the result, and you may get audio breakup. Do the same thing on MacOSX or a specially prepared Linux and the change is instant- timing accurate to 2/1000ths of a sec is quite possible, 1 millisecond is very difficult, but almost possible, and as machines get more CPUs it is likely to end up quite possible. This timing would make games far easier for audio people to work with, and is likely to get more of the interactive music systems and audio synthesis geeks interested in doing game work.

The attitude of game developers seems to be that audio is good enough running at 30-60 FPS. I'm talking about audio running at around 1000 FPS, and those are not normal game FPS: game FPS vary up and down, but audio needs each frame to take a constant amount of time, so this 1000 FPS means each frame has to have finished before it's millisecond is up (because then the buffers get swapped and it gets sent to the soundcard, which makes a hideous noise if you overrun). Of the next-gen consoles this should be possible on the PS3 at least (I hope) [:)]

When you design software for low end RTOS timing like this it doesn't mean it won't run at higher latencies, so it can be cross-platform. But software that isn't designed for low latency generally needs a complete re-design and re-write to make it work with in a low latency environment.

To summarise: low latency audio will make games more immerse. It will make a game respond much more like a musical instrument. This latency requires very tightly disciplined software development, as well as having an effect on system performance, but this effect is minimised by multiple and multicore CPUs.

Submitted by Kalescent on Wed, 31/08/05 - 10:32 AM Permalink

Nice! [:)]

Okay with latency in mind... What kind of application can this be put to good and obvious use in a typical game environment - am I right in saying that what your talking about would allow accurate reproduction of the sound of every bullet being fired from an m-16. rather than just a prerecorded loop that cuts off when I stop firing.

And when and if possible it is slowed down, everything about the gun and the sounds it emits could be accurate to within 1-2 1000ths of a second?

Being a fan of playing games with a massive home theatre - earth shaking bass, bullet ricochets whistling past my ears etc - Im not sure how much difference that pinpoint accuracy is going to make to the whole experience.

Hearing would be believing thats for sure - but I do know that if and when it happens, the implications it has on animating the m16 for example, and having particle effects emmited all in harmony with the sound and being accurate to within that 1000 fps mark... well golly [:0].

*boggle*

Excellent - hit me with more. [:)]

Submitted by lorien on Wed, 31/08/05 - 10:40 AM Permalink

You got it exactly [:)] The particle effect your talking about is called Granular Synthesis http://en.wikipedia.org/wiki/Granular_synthesis (well worth a read, and you'll get some ideas of what I'm meaning by realtime scriptable audio synthesis).

This is quite doable on Linux, MacosX, BeOS, QNX, and IRIX right now, and has been for years [:D] (at a cpu cost that's prevented it for games though). It's also been 2-4 milliseconds per frame for Linux until recently.

You are likely to find this interesting too http://en.wikipedia.org/wiki/Real-time

Submitted by lorien on Thu, 01/09/05 - 9:51 AM Permalink

Apologies for quoting myself- it feels rather odd... [:)] I'm just low on time.

2)Realtime scriptable audio synthesis

The following is from the proposal for my masters degree candidature:

quote:

One of the key differences between actual reality and the virtual realities of computer games is the comparatively static nature of the virtual realities. In graphics this can be exemplified by the BSP (binary space partition) tree, which needs re-compiling when any geometry changes. In audio this static nature comes about from the use of a small number of samples to represent a large number of events. This leads to the environments simulated in games seeming fake, and hence makes them less immersive.

The best solution to the audio side of this problem involves replacing direct sample playback with real-time synthesis, generating sounds unique to each event. Real-time synthesis takes considerably more CPU time than sample playback, but the results are well-worth it; for example instead of hearing exactly the same sound every time a sword hits a shield, the sound gets created from, or affected by, parameters such as the materials and size of the sword and shield, the position in each at which the collision occurs, the amount of force involved in the collision, etc. Real-time synthesis can bring some of the infinite variation of sound that occurs in reality into virtual reality.

Common techniques for audio synthesis include:
Additive: summing sine waves with potentially time variant frequency and amplitude to obtain a desired spectrum.
Subtractive: creating a sound by carving away undesired parts of a spectrally rich source with filters.
Distortion: frequency modulation, amplitude modulation, wave-shaping and discrete-summation all have the characteristic of producing spectra dependant on the amplitude of the input signal.
Wave-table: the combining of many real-time manipulated pre-recorded sounds into an instrument. This had been the dominant form of synthesis used by musical instrument manufacturers over the last 10 years.
Granular: mixing together a large number of grains or quanta of sound. Resembles a particle system.
Physical modelling: modelling the characteristics of a sound-producing object through equations.
Spectral manipulation: analysing a sound into a frequency/amplitude graph of partials using a transform such as the Fourier, performing manipulations on the analysis, and re-synthesising (the inverse of the transform).

Current games for the most part do not use real-time audio synthesis. If synthesis is used it is done offline, and the results stored in samples, which are simply played back in response to events.

The most spectacular (and most CPU intensive) of the synthesis techniques are

Physical modeling- with the sword and the shield example above, representing the acoustic properties of the sword and the shield, and the precise effects of the collision, mathematically in code that generates a soundwave that is sent to the soundcard as it's being calculated [:)]

Spectral manipulation lets you do beautiful and crazy things like truly morphing between sounds (spectral morphing), taking the parts of a voice that make it recognisable as speech, and applying it to another sound like fire- the result is quite literally talking fire (cross-synthesis), changing the length of a sound without changing it's pitch, changing the pitch of a sound without changing it's length, and many others. It's too complicated to me to explain how this magic works here atm

If you are interested in audio synthesis, you need to check out the software synthesis and linux distro sections on http://linux-sound.org .

The following is a quote from a research paper by my supervisor (Dr John Rankin) and myself, that is published in the proceedings of the ADCOG 2004 conferece, City University, Hong Kong, where I presented it. ADCOG is an anual games research conference http://www.adcog.org . I've included the references for the quoted section. Edit: just noticed the adcog site is down. Here's the google cache
[code]
Paste this into your browser, sumea refuses to link this url properly
www.google.com/search?q=cache:3bA3Sf1j8qwJ:www.adcog.org/+adcog&hl=en&i…

[/code]
quote:

Scriptable Audio Synthesis for Computer Games

Audio in computer games lags far behind graphics in every aspect. To someone from the computer music field this seems very odd, as computers have been used to synthesise sound since the days of punched cards, and leading companies such Hewlett-Packard, Bell Telephone Laboratories and IBM have conducted years of research on computer synthesis of music, and indeed have had composer-programmers on staff for the purpose of research into digital audio. Thus there is a rich literature of computer audio research, much of which relates to real-time interactive systems (which of course is what games are) just waiting to be made use of. To-date the results of this extensive research are not being used to the best advantage in the games industry.

Occupying a key place in this literature are the MusicN type audio synthesis languages [1]. The first MusicN language was Music1, developed by Max Matthews at Bell in the late 1950s. Matthews' made several more revisions (Music2, 3 and 4) before giving the source-code to two American universities- Princeton and Stanford, where audio synthesis research proliferated (as it does to this day). Each customised the software to their needs, resulting in Music6 and Music10 at Stanford, and Music4B and Music4BF at Princeton. Matthews also continued his work at Bell, producing Music5. The Princeton branch is the distant ancestor of the most widely known modern implementations: CSound [2] from the MIT Media Lab and SAOL [3] which forms part of the MPEG-4 standard, both of which were designed and originally implemented by the composer-programmer Barry Vercoe. MusicN languages are for the most part specialised scripting languages with high performance compiled Unit Generators, the arrangements of which are determined by scripts. These scripts are often called Orchestras and/or Scores, an orchestra containing Instruments (sound generators) made up from unit generators connected via Cables, and a score containing instructions for the playing of each instrument [4].

In this paper we are primarily concerned with the computer synthesis of sound effects rather than music within games.

In physical reality sound is produced by the vibration of objects. There is an inseparable coupling between the object and sonic signatures it produces when induced to vibrate. In current virtual reality there is no such coupling, and the sounds a virtual object appears to produce have no real relation at all to the object. A pre-recorded sound is merely triggered by an interaction. We propose that this is a key factor in game-audio seeming fake.

In most MusicN languages the orchestra must be given the instructions of "when to play" and "how to sound" from the score. It is the latter that is missing in game-audio, the "orchestra" is stored samples, and the "score" is the stream of game-play events and parameters. Games make great use of the events for triggering audio, and minimal use of parameters.

Current generation computer games for the most part use fairly direct sample (pre-recorded sound) playback in response to events, the most basic being a collision. It is fairly direct because the actual sound played for an event may be selected by means of a probability distribution, and the selected sound may have comparatively simple transformations applied. If synthesis and/or complex transformations are used at all it is during the creation of the samples, where it is frozen in place.

The score for a piece of music created with a MusicN language is also made up of events, which may be dynamically generated through algorithmic procedures and/or interaction with a performer via a transducer (an "interactive music system"- in many ways very similar to a computer game).

Thus one way to improve audio in games is to synthesise sound in real-time in response to game-play events, and such a synthesis system must be fully scriptable so that someone trained in music programming (as is commonly taught in university music departments) may take full advantage of the system. This is part of the rationale for SAOL. However it is common for games to embed a scripting language for controlling entity behaviour, and this behaviour invariably involves the production of sound. Using SAOL with its orchestra and score paradigm produces a weaker coupling between object and sound than is desirable.

References

1.Dodge, C; Jerse, T A: Computer Music. pp 12-13 Schirmer Books, New York, 1985.
2.Vercoe, B: The Public CSound Reference Manual. Available online at http://www.lakewoodsound.com/csound/hypertext/manual.htm
3.Scheirer, E D; Vercoe, B: SAOL: The MPEG-4 Structured Audio Language. Computer Music Journal 23:2 pp31-51. MIT Press, Cambridge, MA. 1999.
4.Dodge, C; Jerse, T A: Computer Music. pp 12-193 Schirmer Books, New York, 1985.

The main change in my work since then is the discovery that scipting languages can be made that can run multithreaded and at low-latency, which has a bunch of implications for much better sound still.

edit: here is an updated URL for the CSound manual http://www.csounds.com/manual/

Submitted by lorien on Tue, 06/09/05 - 8:31 AM Permalink

This one is going to take me awhile I'm afraid- it's full of the physics and psychophysics of sound. It takes time for me to figure out how to explain it concisely without using loads of audio technobabble...

3)Real geometry support

This is where the real processing power is needed [:)] To understand why geometry is important for audio you have to understand how current 3d game audio works: it's a crude physical model. Your brain uses "cues" (really these are hints) from the way sounds arrive slightly different at each eardrum, for example a close directional sound pointing straight at your right ear is going to be muffled by your head when it reaches your left eardrum, and the speed of sound dictates that there is a slight time delay before it gets there too. Expand into 3 dimensions and add some other cues as well and you have a basic understanding of how we percieve the position of a sound.

How game audio models these cues is the Fast Fourier Transform (FFT). 3d sound in games is a kind of spectral manipulation synthesis- in an anechoic chamber (a room with walls that absorb all sound without a single echo) they take a physical dummy head, put small microphones in its ears, and start recording and cataloging "impulses" (short, percussive type sounds with large frequency range) played in positions all around it. By figuring out the differences between these recordings and the original impulse you get much of the information you need to position any sound anywhere using software. The FFT and inverse FFT are the magic that does this, it is hardware accelerated in your soundcard. This specific application is called the Head Related Transfer Function (HRTF).

That all seems nice and sensible, but here are some fundamental flaws in this model:

*It only works properly with headphones. With speaker and room acoustics added in 3d sound gets decidedly dodgy.
*Everyone has a different head, and these cues are so tiny that the differences between heads can lead to large differences in audio- you are expecting different cues (from your experience in the real world) from the ones the head modeled by the software provides.
*It is impossible to ever record enough impusles, so an averaging (interpolation) process is used for positions without an impulse, leading to innacurracies.
*It takes no account of echoes and the sense of acoustic space (the feel you have for the environment you're in from the way it reflects sounds back at you)
*The number of simultaneous audio channels is limited largely by the number of FFTs and IFFTs that can be performed in real-time.

EAX (Creative's Enviromental Audio Extensions) are an attempt to solve the problem of acoustic space that's been bolted on top of this model, and they will be my next post.

Submitted by Kalescent on Tue, 06/09/05 - 11:53 PM Permalink

This is all great stuff Lorien - thanks for taking the time to write this stuff down. Spectral manipulation sounds like a fun thing to play with..... And the sword and shield example above really is the guts of the stuff that im thinking of - and was the reason for bringing this topic up - Good to know its all coming along.

Real geometry support - aside from the accoustic space and echoes flaw and the more channels, what will an overall tightening up of the other flawed areas do for my sound experience while playing a game, how would they effect my ears and what I hear?

*getting into knowledge sponge mode* [:)]

Submitted by souri on Wed, 07/09/05 - 12:20 AM Permalink

There was a website I visited a while ago where you could type in a phrase and the news presenter (it looked like a flash animation) would speak it out. It did an unbelievable job too, far better than your standard speech ability in Macs or the extension from Microsoft. If anyone knows which site I mean, please post it here [:)]

Anyway, my contribution to this thread is that I'm surprised that believable speech synthesis for games still isn't common at all. We still have NPC's spitting out the same repetitive sentences when you try to interact with them. Can you imagine the improved gameplaying experience (and disc space savings) in RPG's and games like Grand Theft Auto where NPC's responses are only limited by the number of sentences written out for them by the developer? Couple that with the AI/interactivity experience of [url="http://www.interactivestory.net/"]Facade[/url], and we could be on some really fantastic immersive games. [:)]

Submitted by lorien on Wed, 07/09/05 - 6:18 AM Permalink

HazarD: no problem, imho this stuff needs to be much more widely understood (as does sound in general) [:)]
Spectral manipulation is my favourite form of sound bending. I feel physical modeling is the ultimate goal of game audio. All the forms of synthesis can be combined into hybrids, and they all work perfectly well alongside sample playback too.

Real geometry support leads to much more than echoes and acoustic space- it has the potential to entirely eliminate the HRTF (the FFT is lossy). That's why I'm going to explain EAX first.

Souri: Speech synthesis is so insanely complicated I wasn't going to bring it up here. Think along the lines of trying to mimic the waveforms produced by the vocal chords, nasal cavity, mouth, tongue and lips. Speech is one of the most complex sounds, and it's one we are all very good at telling the real from the fake. It would be so cool to be able to do well in games...

Submitted by Shplorb on Thu, 08/09/05 - 9:51 AM Permalink

Wow, you guys are wanting way too much from game audio. Although it sounds interesting, I seriously doubt that procedural synthesis of sound effects will be viable. It just needs too much physics and will most probably be too hard for sound designers to control to ensure that they always get the sound that they want.

Take a look at FMOD Ex to see where I think next-gen sound is going. In case you don't want to look at it, basically next-gen game sound is going to be about DSP, mix routing and real-time data-driven design.

Granular synthesis is I guess something that groovyone has discussed with me recently, except we've labelled it "composite sounds". He'd like to use it for bullet ricochets to reduce SRAM usage whilst maintaining or improving sound variation, but I don't think it would really be viable on the PS2 without hacking FMOD or replacing it with my own low-level sound engine.... something I don't have the time for and something I don't want to do when there's much more interesting work to be done at the higher-level.

As for latency, of course it will be reduced on next-gen hardware, purely because of the multi-core nature and increased speed of the systems. I think the PS2 has a 12ms mixbuffer, but don't quote me on it. Since the PS2's sound chip (SPU2) is essentially two PlayStation sound chips (SPU) it is rather limited in what it can do, compared to the Xbox with it's DSP and AC3 encoder.

Improved reverb (can you say "DSP"?) and propagation (can be currently simulated with low-pass filtering) are two things that I think will add to the realism of next-gen sound, but at the same time I also wonder just how much realism you want. Movies aren't real, they exaggerate sounds to play your emotions. That's why I think that the real direction of next-gen sound is going to be in giving more control to sound-designers, or at least that's what groovyone and his cohort have led me to believe with their constant feature requests. =]

In the current game that groovyone and I are working on, we're implementing something we're calling 'mixgroups', which allow us to have different mixes for the sounds in the game to suit different scenarios like driving versus cutscenes versus a gunbattle. That sits on top of an existing system that allows designers to group contextually relevant sounds and choose what sort of sounds they are and their playback parameters in real-time whilst the game is running. We can already produce varying sounds from things like footsteps and collisions through the use of different samples of variations of the sound and playing them back with varying volume, pitch and bandpass filtering and making use of crossfaders that are fed values from the physics system.

Besides the increased processing power that will allow for more channels, DSP effects and flexibility in mixing (because it's going to be all done with software), the really big bonus of next-gen platforms is the drastically increased amount of RAM we'll have available for samples (I would love to have 8MB or more!) that can be longer and have higher sampling rates and the increased storage I/O for streaming.

Anyway, that's my $0.02 rant for now.

Submitted by groovyone on Thu, 08/09/05 - 10:29 AM Permalink

No WAY!

Next Gen is all going SID FARMS with DSP FARMS to support them!!...
128 channels of SID Madness + 16 effects busses! We can digitize speech into 4bit pulse data to be played back on SID chips!

No CPU hit! True synthesis!

Long Live SID!!!

Bleep bleep bleep!

:)

Oh.. as for 5.1 DTS sound, we can just pump banks of SIDS out to different speakers :).

Submitted by Shplorb on Thu, 08/09/05 - 10:45 AM Permalink

quote:Originally posted by groovyone
Next Gen is all going SID FARMS with DSP FARMS to support them!!...
128 channels of SID Madness + 16 effects busses! We can digitize speech into 4bit pulse data to be played back on SID chips!

Oh.. as for 5.1 DTS sound, we can just pump banks of SIDS out to different speakers :).

My god man! Imagine how many filter caps you'd need! Maybe that explains why those PS3 alpha kits are so frickin' huge... they aren't using surface mount components yet. =]

Submitted by lorien on Thu, 08/09/05 - 10:30 PM Permalink

[:D] I've warned before that I'm an audio freak...

For those who think this is too much just wait till you hear the geometry stuff [:)]

Also Intel and AMD have been making noises about putting 32 CPU cores on a single chip in future. Running high end audio stuff on one or two of these cores really won't make much difference. And I suspect the Cell SPEs could do a hell of a job at audio synthesis.

As for it being too hard to control for sound designers, physical modelling isn't too hard for musicians... Commercial physical model synths have been around for 10 years or so, and now on MacOS we are seeing commercial low-latency physical modeling softsynths. Even guitarists can manage them [:D] I don't really understand this point, because synths have been used for music for years and years, and synthesis is what originally made game sounds.

SID did rock- imho it was the Right Way (tm) to do game audio: put a close to pro level (for the time) synthesiser in a game.

I know you were joking groovyone, but multicore CPUs will allow what you describe, all in realtime, while all the rest of a game is running in parallel, and the Cell is a "streaming media processor" i.e. the SPEs could be looked at as being independent floating point vector DSPs.

I know the sort of things that the new fmod can do (it includes putting software effects on sounds in realtime, geometry support in the software mixer, etc, etc). IMHO it's not nearly enough. You can read about some of the new features on the mainpage of http://fmod.org .

My take on scriptable audio synthesis is that it is entirely about giving more control to sound designers. Just a sound designer with some different skills than you commonly see atm. The "scriptable" stuff is to make it useable by audio people rather than dedicated programmers. Musicians have been using this type of tech for a long time.

I completely agree that exact realism isn't always desireable, synthesis allows to to do all sorts of crazy exaggeration, it's just as easy to provide exagerrated parameters to a realistic physical model as it is to a crude one, just the results will be better.

I think aspects of what I'm talking about will make it into next-gen, and more aspects will make it into the generation after that.

I also think that if people start doing full-on audio stuff in software in games, the hardware designers are likely to notice (we already have people doing audio on GPUs), and actually make some sensible soundcards (though I'm sure Creative would have their normal try at buying out anyone with better tech and closing them down).

Thanks guys, it's good to have audio people other than me talking here [:)]

Souri: I think that software must be using a database (probably of the building blocks of words rather than words themselves), and combining spoken audio together in realtime. Not speech synthesis as such, and probably not something that you could get to sing a song. Very cool though.

Submitted by groovyone on Thu, 08/09/05 - 11:50 PM Permalink

Physical modelling and synthesis is all great, and I am sure there may be some games who go as far as to use this for certain spot "effects" but as a main stream thing it's going to require oodles of time to properly program/script them to do anything useful. Not only that, but how many sound events occur in real life and are needed to be modelled in game? How much time will it take to model something to the point that it's almost recognizable? It only takes a sound designer a few hours to build a complex sound. To generate something by synthesis that sounds anything like it would take days if not a week, and they may still not even come close.

I agree that For instruments, synthesis is useful. Imagine being able to change their filters, timbres in real time to help change the mood of the music. (ala SID)

All I say for now is give us:
1) Memory
2) A sound DSP CPU for software DSP
3) HardDrive for buffer/streaming

Submitted by lorien on Fri, 09/09/05 - 1:05 AM Permalink

I've never suggested that a sound designers job should be to model everything sound by sound... I've more been thinking along libraries of tweakable scripts, which a sound designer can add to and modify if they choose to.

Most of they synth techniques are nothing like as processor hungry as physical modelling.

We have different goals I think- yours is directly making games, mine atm is finding different ways to make games (a research degree), hence I'm aiming high and immediate commercial application is not a huge concern.

This engine/scripting language I'm making is a professional sound library that can be used in games (much of it will be open source btw). It's being designed to be good enough to use in creating pro realtime music software, and if people making commercial games don't want this tech yet that's fine by me of course [:)] I'll be using it in art and indy games. There is still probably 1 1/2 years of work before it's ready for release anyway- to get RTOS like timing you have to start from a very low level indeed [:(] and I finish my MSc halfway through 2007.

I think what you are saying is you want things that will make the job you're doing now easier. I understand that completely and I sympathise for anyone trying to get good sound out of the current consoles except the xbox (which has simple audio scripting btw). I think FMod is fantastic for getting the day to day needs of commercial game dev looked after very quickly. But from my own experience with FMod 3.x I know if you want to do something that is out of the ordinary, all the work the Firelight put into making FMod so easy really gets in the way (i.e. you end up fighting it and hacking around some of its features rather than using it).

I'm looking at doing the job differently, and where I'm finding inspiration isn't games, it's pro and research audio and music software.

Submitted by Shplorb on Fri, 09/09/05 - 11:39 AM Permalink

quote:Originally posted by lorien
to get RTOS like timing you have to start from a very low level indeed
I think you're overblowing the latency thing way too much. At 60Hz you have a 16.7ms mixbuffer. There's no point in reducing latency below the point at which you are triggering sounds, which is purely dictated by interval at which the game state is updated. Games with parallelised rendering pipelines are already one or two frames behind the current game state, so really there's no issue with syncing sound to gfx.

quote:I sympathise for anyone trying to get good sound out of the current consoles except the xbox (which has simple audio scripting btw).
Simple audio scripting? I don't want to sound rude and arrogant, but do you even know what you're talking about? Are you talking about XACT?

quote:I think FMod is fantastic for getting the day to day needs of commercial game dev looked after very quickly. But from my own experience with FMod 3.x I know if you want to do something that is out of the ordinary, all the work the Firelight put into making FMod so easy really gets in the way (i.e. you end up fighting it and hacking around some of its features rather than using it).
Which is why I was talking about FMOD Ex. =] MacOS X's CoreAudio and Audio Units are pretty much the same thing.

Submitted by lorien on Fri, 09/09/05 - 9:27 PM Permalink

quote:Originally posted by Shplorb
I think you're overblowing the latency thing way too much. At 60Hz you have a 16.7ms mixbuffer. There's no point in reducing latency below the point at which you are triggering sounds, which is purely dictated by interval at which the game state is updated. Games with parallelised rendering pipelines are already one or two frames behind the current game state, so really there's no issue with syncing sound to gfx.

Fine, no problem [:)] I just think otherwise... Latency is one of the cues used to determine the position of a sound, and I'm working in multiple threads anyway, so it's independent of game state i.e. I have audio rate (processing samples using SSE vector operations), realtime control rate (single float processing, 1 per frame), and game control rate (completely asynchronous to audio), and latency on the realtime audio and control rates is a killer. The thing about 60 Hz on windows is you can't do it. Sure you can get way over 60fps, but just try to get each frame to take a constant amount of time, so you can actually get that 16.7 msec latency. Also while I agree ~17 milliseconds is acceptable for triggering samples that's not what I'm trying to do.

quote:
Simple audio scripting? I don't want to sound rude and arrogant, but do you even know what you're talking about? Are you talking about XACT?

I'm talking about AudioVBScript, which is also part of the DirectX sdk and DirectMusic Producer. It's simple because it's a toy language and doesn't let you touch most of DirectSound/Music, but it seems to be designed to give more control to sound designers: the idea seems to be that programmers call script functions which are written by a sound designer. It doesn't do synthesis at all.

[snip my stuff about fmod]
quote:
Which is why I was talking about FMOD Ex. =] MacOS X's CoreAudio and Audio Units are pretty much the same thing.

Stick with FMod then! [:)] All I'm saying is it's not for me (nor is Core Audio for that matter). You won't be getting any marketing division trying to push synthesis on you (at least not from me), this software is going to be free, people will be able to use it if they want, and if they don't it's no skin off my nose.

But if you would like to experiment you will be able to (actually you can now, though with a system that is not designed for games at all- CSound has become a VST plugin, FMod Ex supports VST plugins, therefore you could probably use CSound in a game if for some strange reason you felt like it).

Submitted by Mick1460 on Sun, 18/09/05 - 3:51 AM Permalink

LOL

Ok, here is what I want with next-gen...

1 - More bloody disk space so I can stop listening to 22kHz sample rates.

Thats pretty much it sorry guys. There is no way that we will have realistic voice synth anytime soon (thank God to be honest), nor real-time sound generation (ie, when the player knocks a can off a wall the engine GENERATES that sounds rather than playing back a pre-recorded .wav file. To me, its the same as pretty high-poly models, they are nice took look at but it adds NOTHING to gameplay. I think that the only sound element, apart from surround sound, that will enhance gameplay is disk space.

Submitted by pb on Sun, 18/09/05 - 9:57 PM Permalink

quote:Originally posted by Shplorb
I think you're overblowing the latency thing way too much. At 60Hz you have a 16.7ms mixbuffer. There's no point in reducing latency below the point at which you are triggering sounds, which is purely dictated by interval at which the game state is updated. Games with parallelised rendering pipelines are already one or two frames behind the current game state, so really there's no issue with syncing sound to gfx.

Yeah, I completely agree with this point, glad someone typed it out. The order in which parts of the game state get updated within a frame has nothing to do with real-time. That's why both graphics and audio treat the frame as an quanta, firing off everything at the same time at the end of the frame.

Games with floating frame rates have all sort of additional sync issues that dwarf a few extra milliseconds of latency. There's the fact that the physical time that a frame or two of lag produce changes with the frame rate. In addition the update code can only scale the time for a state update by that of the previous frame, not the current (since it doesn't know how long that will take to render - there's just an implicit guess that it will be as long as the last frame).

But floating frame rates blow - I really hope that the TRC/TCR requirements for next-gen mandate 60Hz (or at least fixed frame rates). Might screw some PC developers but if they want to do console stuff they should do it right or stick to PC.

As for what HW features would make good audio, I reckon disk space, 6 channel output and a ludicrous amount of programmable DSP power with the bandwidth to keep it processing.

pb

Submitted by lorien on Mon, 19/09/05 - 5:09 AM Permalink

I've actually been talking about running this stuff in a separate thread all along (means it runs at a frame rate completely independent to the game frame rate).

Mick, my favourite games are actually Elite and Nethack [:)] It's very rare for me to play a AAA title. I whole heartedly agree with high poly models and audio synthesis doing nothing for (normal) gameplay. IMHO for games they are fluff. I've been interested in sound and audio synthesis for much longer than I have in games, and I plan to use this engine for plenty of things which aren't games too [:)]

Ludicrous amounts of disk space and DSP power are for me pb [:)]

Submitted by pb on Wed, 21/09/05 - 5:01 AM Permalink

quote:Originally posted by lorien

I've actually been talking about running this stuff in a separate thread all along (means it runs at a frame rate completely independent to the game frame rate).

But the events that trigger game sounds all quantize to the game frame rate (or at least they should since the physical order of update within a frame is meaningless).

The best you could do is put the game logic on a seperate thread (pure server style) and run as fast as you can, then put graphics and sound on their own thread so they can pick out discreet frames from the game logic. If you can update game logic 600 times per second your sound thread could grab each update and the graphics thread might only get 1 out of 10 (or however fast it can manage). But that's a lot of processing to burn to shorten the latency by an amount that will be different to the latency of the graphics pipeline.

These days game logic consumes a larger portion of the frame than it used to. This is largely a result of DMA driven graphics acceleration shortening the render side whilst physics, AI, big complex worlds and lots of random access on architectures that have massive CPU:memory speed ratios result in slower game update. That means its not practical to have an ultra fast game update thread. If it were you would've seen more (any?) PS2 games doing all their game logic on the IOP and leave the rest for rendering.

pb

Submitted by lorien on Wed, 21/09/05 - 9:36 AM Permalink

A big concern for me is a hi-resolution "control rate" (k-rate in CSound terms). Yes, of course the game sounds getting triggered will quantise to the game frame-rate, but things like fades, envelopes, and synth parameters should often (imho almost normally) be independent from game FPS anyway. A bunch of functions get triggered by the game logic thread, and continue on their own in the audio thread.

With a control rate much below 1000 fps it's fairly easy to hear quantisation artifacts in audio parameters, hence the non-realtime softsynths don't have a control rate at all- it's the same as the sampling rate.

No argument about CPU/memory bandwidth issues, afaik it shouldn't be so much of a problem on the PS3. In the normal computer world things are going NUMA (non-uniform memory architecture) to solve the bandwidth problem, this is how Sun's Opteron based servers work now. NUMA means each CPU gets it's own memory, and whilst every CPU is able to use every other CPU's memory, it is much faster to use their own, so they heavily favour it. NUMA is part of the Linux kernel right now (it's one of the things that got SCO so upset).

Submitted by Brett on Wed, 21/09/05 - 1:06 PM Permalink

FMOD Ex sound designer tool is putting all of the control into the sound designer's hands. Complex audio models can be built in it, and the programmer interaction is minimal.

The low level FMOD Ex engine is extremely advanced, the DSP engine can do virtually anything, including realtime sound synthesis (which there are demos for). If you only take a quick look at the API you might it looks simple to start with (actually that's our aim), but dig deeper and there is a whole range of possibilities available.

We are only half way into our feature list, we've built it to be very flexible, so I really think it is well on its way to providing 'next gen' audio. I'm pretty excited by it :)

Submitted by Brett on Wed, 21/09/05 - 1:31 PM Permalink

quote:Originally posted by Shplorb

I think the PS2 has a 12ms mixbuffer, but don't quote me on it. Since the PS2's sound chip (SPU2) is essentially two PlayStation sound chips (SPU) it is rather limited in what it can do, compared to the Xbox with it's DSP and AC3 encoder.

The granularity on the ps2 is blocks of 256 samples @ 48khz so the minimum latency is 5.33ms. We usually mix on most platforms now using a 5ms buffer so i don't think latency is an issue except on windows and linux, but we find 50ms is generally acceptable and stable enough these days. Lorien has mentioned 'latency' but i think he actually means 'granularity', which is a trivial matter to solve and not always nescessarily a problem that needs to be solved. (ie a car engine pitch bending at 60hz is perfectly smooth to the majority of ears.)
pb is right about the audio syncing to the framerate, and especially with things like weapon sounds, even the graphics can take a few frames to actually kick in and display the firing graphics anyway.

As for the PS2, it is fairly low spec now, but we're really pushing it to the limit. We're hoping to allow a benchmark of 32 voices with multiple DSP filters per voice and keep the EE cpu usage under 5%. This is thanks to VU0.

Submitted by lorien on Thu, 22/09/05 - 3:35 AM Permalink

Get that marketing out there Brett (just teasing) [:)] Good to see you on sumea btw.

I'm talking about both latency and granularity, they are different, but related. One kind of latency is a lag between input and output, this is what prevents people from using windows as a guitar effects processor for example. Big lag between playing a note and hearing the results. Granularity is the control rate stuff. Latency and granularity are related- certainly you can have 1 msec granularity and 1 second latency, but you can't really have have 1 msec latency and 1 sec granularity...

I'm not bagging fmod at all, I truly think you guys have done great work, but I'm not going to use fmod in my work (no matter how much the api has improved) unless it goes dual license (open source and commercial). Free binary for non commercial use doesn't cut it with me I'm afraid. I remember from a discussion we had on the fmod forums quite a while ago that you really don't want to do this.

Submitted by Brett on Thu, 22/09/05 - 4:38 AM Permalink

quote:Originally posted by lorien

One kind of latency is a lag between input and output, this is what prevents people from using windows as a guitar effects processor for example.

I don't know about that, we have a 'record' example that records, processes the audio and plays it in about 3ms turnaround using the asio output driver for realtime voice fx. That is pretty good for user level hardware.

quote:
Free binary for non commercial use doesn't cut it with me I'm afraid. I remember from a discussion we had on the fmod forums quite a while ago that you really don't want to do this.

It doesn't really bother me, but i don't see why having the source is so nescessary. Would you use memcpy if it was closed source? Most people wouldnt touch the source because they don't need to, the interface does everything that is needed.