Kristin and I were talking about the BP oil spill this weekend, and I had some thoughts:
- In the Seattle area, much of the business nowadays is high tech. The actual work is done on a computer. Theoretically, there is no reason this couldn’t be done on a laptop at home.
- However, most people still commute into work 5 days a week. At many jobs, at least 1/2 the day is spent in meetings.
- If people were able to attend at least some of these meetings remotely, this would greatly cut down on the amount of commuting that is necessary. In addition, a lot of business trips could be avoided. This would greatly reduce the amount of fossil fuels used around here.
- Yet people hate attending meetings remotely. The reason that is always cited: It is usually difficult to impossible to understand what is being said by the people in the conference room.
Let’s look at a typical conference room (and if you haven’t experienced this situation in your business life, consider yourself lucky). There is a large table, made of a hard material like wood or Formica. There is a single conference phone in the middle of the table, which acts as both speaker and microphone. The conference phone is connected to the outside world via phone lines, which have a 4 KHz bandwidth, mono. The conference phone usually shares the table with a number of laptops, each with their own fan, and is often located close to The room usually has at least 1 wall that is highly reflective to sound, due to a whiteboard being there. In addition, it is common to have one wall be made of glass windows to the outside world (or to elsewhere in the office). Glass reflects sound to a greater degree than your typical drywall found in offices.
The resulting sound is HORRIBLE, especially when you dial into the meeting. The room is reverberant, the table is reverberant, and the fans from the laptops and projector add a noise floor that the reverberant voices need to be heard above. People that are not right next to the conference phone have their voices swallowed by the early reflections. Compressing all of this info into a mono 4KHz stream makes it difficult to differentiate one voice from another – for the people in the room, the spatial information from both ears allows them to discriminate between sound sources.
To add insult to injury, the codecs used by the phone companies for voice transmission assume that the sound they are transmitting comes from a single human voice. Instead of transmitting a downsampled version of the voice, the codec separates the voice into a number of filter coefficients (representing the resonances of the vocal tract) and a “residual” signal that is used to excite the filter coefficients at a given pitch (representing a glottal pulse from the larynx). Trying to transmit several voices speaking at the same time through this model would be like a single person trying to speak in several voices at once, and adding reverb to each of these voices. The result would be a burbling, blippy mess, which anyone who owns a cell phone would instantly recognize.
What is the point of my rant? I believe that if corporations prioritized high-quality audio reproduction and transmission for meetings, then the number of meetings that people have to drive and/or fly to could be greatly reduced. The primary mode of real-time communication between people is speech. The current phone networks are designed around the assumption that speech is being transmitted between two people, and the bandwidth and codecs used for this transmission are inadequate for communication between groups of people, where many of those people share the same acoustic space.
In order to allow conference calls to be tolerable, the issue of reverberation within the conference room needs to be dealt with. There are a variety of solutions that would improve the situation:
- Conference rooms could be acoustically treated. Adding sound absorbing panels can cut down reflections from a wall.
- More microphones can be used, of higher quality. By having microphones closer to the people speaking, the ratio of direct to reverberant sound can be increased.
- People can use individual wireless headsets. This would put the microphone for each person at an ideal location (i.e. right in front of their mouth), largely eliminating the influence of room reverberation. In addition, this would eliminate the need for a speaker in the room, which gets around the feedback issues that can happen with speakerphones.
- Instead of using phone lines, the audio codec could be built into software that allows audio and video conferencing. This allows the codec to be optimized for transmission of several simultaneous voices, without being squeezed into a 4KHz bandwidth. A higher bandwidth stereo codec, such as MP3 or AAC, would allow easy localization and discrimination of voices within the typical conference room environment.
None of the technology described above is new or radical. The big change in the corporate world would be the prioritization of audio quality, as a way of making online meetings a viable alternative to in-person meetings. If corporations were encouraged to switch over to online meetings, as a way of reducing fossil fuel consumption, the need for higher quality acoustics and codecs would become immediately apparent. Better acoustics = better world.
point 1. Yes, most room suck. Treatment would also help bleed from other rooms. I can’t say that I’ve every seen a treated room and I doubt most corporate facilities would spring for it.
point 2. Without question, I am surprised given how expensive most business phones are that the speaker / mics are not better. Polycom is the best of the bunch which is why they dominant the conference rooms via the ‘star phone’.
point 3. yeah, not gonna happen for the simple reason it would look dorky
On your last point, the move to SIP is already helping introduce better codecs in the form of g.722 (and variants). In particular, the Polycom HD series (which make use of their g.722.1 implementation) sound fantastic.
Problematically, there is the flip side of g.729 which is favored by many voip providers due to the low bandwidth consumption. In theory it is g.711 equivalent in quality but my experience is that in implementation that doesn’t hold true (think a mp3 version of g.711).
None of the codecs are stereo though I wonder if that would be valuable since most people use a single ear piece.
In many companies the push to video has upped the audio quality standards. When companies install video, they generally take the time to also install a better mic & speaker system. The better implementations tend to be hung from the ceiling rather than table top. Table top implementations suffer from additional noises such as the laptop typing and vibrating cellphones.
To the main point, why con calls suck. The main problem I’ve seen come down to conversation dynamics. There are 2 main scenarios, a) majority in the room with a few remote & b ) all remote.
When only a few people are remote, the conversation is dominated by the room dynamics. The secondary visual cues offered by physical presence move the conversation along and people jump in and out of the conversation more freely. The remote conference members have a harder time entering and leaving the flow. It is hard on the people in the room as well.
When nearly everyone is dialing into a conference call the dynamic is much different. The conversations tend to develop more structure. Conferences benefit from a facilitator to ensure flow and that everyone is engaged. In companies that have this culture, conference calls tend to be much more productive as they develop a new set of conversation & behavioral expectations.
The other problem with remote calls is that not being in the room makes it easier not to pay attention on during a meeting. You can read your email, surf the web, grab a sandwich, etc, and before you know it you have missed important details.
Interesting points, bp. To address a few of your comments:
– Yeah, headsets are dorky, but it would allow stereo placement of sound sources. If we could get people to bring back the 1991 Madonna tour look, we’d be set.
– I agree about conference calls where everyone dials in. If you combine these with a desktop sharing program like Go2Meeting, you have the visual stimuli as well as audio.
– Visual stimuli would also help with paying attention. I don’t think that having a camera on every participant is necessary, but having shared visuals on everyone’s screens makes it easier to focus on the topic at hand.
Thanks for listing the codecs – I haven’t kept up on specific ones. I will research g.722 and g.722.1 tonight. It would be interesting to test various codecs for comprehensibility in reverberant environments. My guess is that the LPC-style codecs would fare poorly, compared with other transforms. I wouldn’t be surprised if these tests have already been undertaken – telecommunications is not my area of expertise. I know how to reverberate, but de-reverberation is another matter entirely.
You may even more right than you know! I am actually planning to research this topic in the near future and spread the truth of what you are saying. I participate in a lot of teleseminars and podcasts and I see some people make the misdirected EXTRA effort and expense to add reverb to their voices, usually making the voice lose most of the appeal that person had using no reverb. Others use the dreaded speakerphone when they are presenting. After some years of lots of time in recording studios I know exactly what I am hearing and sometimes just want to scream: turn DOWN that reverb! If you or your wife have interest in pursuing this topic feel welcome to get in touch with me.
I’ll see your average conference room and raise you the following: a typical mall-style ice cream store, with glass and tile in every direction … and a grand piano for customers to play while waiting. Luckily I didn’t work there!