Article – From the horse’s mouth
Microsoft has completed some research on how to amalgamate audio recordings of a meeting that were captured by different recording devices to turn out a higher-grade recording that captures the whole of a meeting. It is seen as being the audio equivalent of experiments and projects that aggregate multiple camera views of the same object, or could be seen as a way to create a “Claytons microphone array” using multiple recording devices with their own microphones.
The technique involves the creation of audio fingerprints of each of the recordings in a similar vein to what Shazam and its allies do to “name that song”. But these fingerprints are used to match the timing of each of the recordings to identify what was commonly recorded, allowing for the fact that one could start or stop a recording device earlier or later than another person.
The technology that is assumed to be used in this context are standalone file-based digital notetaker recorders or the audio-recording function incorporated in many a smartphone or tablet typically by virtue of an app. Typically these recorders are recording the same event with integrated microphones and implementing automatic gain control and, in some cases, picking up their “own” background noise.
But you could extend this concept to integrating audio recordings made on legacy media like audio tape using standalone devices, or the soundtracks of video recordings recorded during the same event but are subsequently “dubbed” to audio files to be used in the recording. A good example could be someone who uses a “shoebox” or handheld cassette recorder to make a reliable recording of the meeting using something they are familiar with; or someone videoing the meeting using that trusty old camcorder.
There are plans to create further research in to this topic to cater for recording music such as when the same concert performance or religious service is recorded by two or more people with equipment of different capabilities.
A good question to raise from the research is how to “time-align” or synchronise a combination of audio and video recordings of the same event that were recorded at the same time with equipment that has different recording capabilities. This is without the need to record synchronisation data on each recording device during production, and allowing for the use of equipment commonly used by consumers, hobbyists / prosumers and small organisations.
The reality that can surface is someone records the event using top-shelf gear yielding excellent audio while others film from different angles using camcorders, digital cameras and smartphones that record not-so-good sound thanks to automatic gain control and average integrated mics, while the good digital cameras and camcorders still implement their excellent optics and sensors to capture good-quality vision.
Once this is worked out, it could then allow a small-time video producer or a business’s or church’s inhouse video team to move towards “big-time” quality by using top-shelf audio gear to catch sound and the use of one or two camcorders operated by different operators to create “TV-studio-grade” multi-camera video.
Who knows whether the idea of post-production audio-level synchronising and “blending” for both conference recordings and small-time video producers.