|
The Real-time Note Processor
Version 1.0 of AudioExplorer's NoteProcessor represents a first attempt to
bridge the gap between the complexities of musical audio and the simple,
discrete events of MIDI. Each of the processes described below was
developed ad hoc - which is to say: I am an audio newbie (not an
expert), armed with a
toolbag of mathematical tricks and some common sense. My own impression of
this version of the NoteProcessor (and the MIDI Assembler) is that it is
"pretty good". The generated MIDI is often an
"impressionist" version of the audio input. Cool, but inexact.
Note that the discussion below pertains to real-time generation of
MIDI. Batch generation of MIDI is rather simpler and cleaner, as it is
possible to pre-analyze the entire audio file and determine a reasonable set of
parameters.
The purpose of this document is to describe as clearly as possible the inner
workings of the NoteProcessor so that you can both put it to its best use in its
current form, and understand it well enough to come up with new ideas for making
it better. I will continue my own research and experimentation, but I also
want your ideas. Please, let me know what you're
thinking!
|
|
|
|
|
Note Buffers
AudioExplorer's NoteProcessor consists of 128 note buffers - one for
each MIDI note. Each note buffer is responsible for processing frequencies ranging from a
quarter step below its note's center frequency to a quarter step above. The
general purpose of the note buffer is to maintain information regarding the
history and current status of signals relevant to its note.
The following figure illustrates processing of frequency information by the
NoteProcessor and its note buffers. Notice that all signals are plotted
using logarithmic scaling, so as to better reveal the changes applied at each
step.
|
|
|

|
| Mean Signal Calculation
After each chunk of audio data is analyzed to generate a new frequency
spectrum, each of the NoteProcessor's
note buffers examines the new spectrum. Signals for all
frequencies inside the note buffer's range are weighted (based on the mean
geometric distance for the input frequency from the note's center frequency) and
summed, and a normalized mean signal is calculated.
The note buffer
also calculates a weighted mean frequency, which can then be compared to
the ideal note frequency and used to generate MIDI pitch-bend events.
|
| Shoulder Merging
If you examine the frequency spectrum in the region of a particularly
loud note, you will often observe that the note's signal
"bleeds" into neighboring notes. In some cases, this
phenomenon may be real - as when a dense note cluster has been played,
or when a particular instrumental sound is highly
"detuned". In many cases, however, this phenomenon
results from limitations in the frequency resolution of the spectrum
analyzer. Shoulder Merging is an attempt to
compensate for this limitation.
If Shoulder Merging is enabled, the NoteProcessor examines the mean
signal of each note. If the signal is above threshold:
- Calculate a cutoff signal equal to the note's signal * cutoff
percentage
- Examine the signals of the notes immediately below and above
the center note.
- If the neighboring signals are below the cutoff signal, merge
the neighbor's signal into the center note's signal.
|
| Overtone Processing
Handling of overtones is one of the greatest challenges of extracting
MIDI from musical audio. The unique quality of instrumental sounds
- e.g., what distinguishes a piano "C" from a violin
"C" from a vocal "C" from a guitar "C" -
results from the mixture of overtones produced by the instrument.
In other words, a musical note is not simply a note. A musical
note is a rich combination of some fundamental frequency plus any number
of overtones which the human ear (more or less unconsciously) recombines
into a distinctive instrumental sound.
Suppose that the NoteProcessor examines a frequency spectrum and
finds strong signals at frequencies corresponding to C3, C4, and
G4. It is possible, and in fact not at all uncommon, that all 3 of
these notes are actual played notes in the performance. It is also
quite possible that the C3 was the only note actually played, and that
C4 and G4 show up as first and second overtones of C3. Given no
additional information, it is impossible for AudioExplorer to
decide which of these signals correspond to played notes and which are
overtones. In fact, for some instruments (notoriously, the
piano), a note's first overtone can be stronger the the note's
fundamental.
AudioExplorer allows you to select which (if any) overtones to
examine and merge into either a) the note's fundamental; or b) the
note's strongest overtone.
If one or more overtones have been selected for overtone processing,
the NoteProcessor examines the mean signal of each note in ascending
order. If the signal is above threshold:
- Calculate the overtone series for the note
- Sum the signals for the fundamental and each selected
overtone.
- Assign the summed signal to either the fundamental note or the
strongest overtone, and set all other signals in the selected
series to 0.
|
|
| |
| Threshold
In essence, the threshold is the value above which a
"signal" is considered to be "a note". The
final action of the NoteProcessor is to compare the processed mean
signal of each of its note buffers to the threshold. Signals
greater than or equal to the threshold are considered to be active
notes, and this information is passed on to the MIDI Assembler.
Notice however that the threshold has already been used several times
above to make decisions about whether to merge a note's shoulders and to
examine and process a note's overtone series. The threshold is
clearly a most important parameter in the audio-to-MIDI conversion
process. Choosing a threshold for a particular piece of music is
substantially a trial-and-error process. AudioExplorer does
provide a signal histogram (accessible from the Spectrum Window; see the
figure below), which shows signal distribution for all frequencies at
any given time in a musical selection. In the music that I've
examined, I've never seen a clearly identifiable boundary between the signals
corresponding to played notes and the signals related to sub-audible
overtones, etc. |
|
|
|

|
|
|
| Floating Threshold
A further complication in selecting a
threshold has to do with musical dynamics. A note played softly
during a very soft musical phrase might be easily perceived by the human
ear, but the same note played the same way during a much louder phrase
would not be heard as a discrete note. To accommodate highly
dynamic music, AudioExplorer optionally implements a
"floating" threshold. When using a floating threshold,
the NoteProcessor monitors the maximum signal, and adjusts the threshold
to be a fixed factor below the maximum. During moments of silence
in the music, this could cause the threshold to drop to very low values,
allowing undesired noise as to be interpreted as
"notes". To prevent this, when the maximum note signal
falls below the base threshold value, the NoteProcessor does not adjust
the floating threshold.
|
|
|
|
| Maximum Signal
Just as the NoteProcessor needs a threshold to tell it the signal
value at which a note "starts", it also needs to know an upper
limit for the signal. The MIDI Assembler uses the threshold and
the maximum signal to calculate note velocities for the note-on MIDI
events.
Similarly to the threshold, the maximum signal can be either fixed or
floating. A fixed maximum signal is exactly that - a fixed
value. MIDI note-on velocities are calculated based on a note
buffer's mean signal, the current threshold, and the (fixed) maximum
signal. A "floating" maximum signal tracks the maximum note
signal observed in the current spectral input. However, if the
maximum signal tracked the observed maximum exactly, there would always
be at least on MIDI note generated with the maximum note velocity
(127). Since this is not a realistic outcome, I've
introduced the concept of a "change rate". The change rate is a fractional value between 0 and 1.0 which
determines how closely the NoteProcessor's maximum signal chases the
maximum note signal. A value of 1.0 causes the tracking to be
exact. A value of 0.5 means that one half of the difference
between the NoteProcessor's maximum signal and the observed maximum
signal is applied. The change rate functions as a
damper - small values cause the NoteProcessor's maximum signal to change
more slowly in response to changes in the level of the input audio
signal. |
|
|
|
|
|
|