The Real-time Note Processor

 

The Real-time Note Processor

Version 1.0 of AudioExplorer's NoteProcessor represents a first attempt to bridge the gap between the complexities of musical audio and the simple, discrete events of MIDI.  Each of the processes described below was developed ad hoc - which is to say: I am an audio newbie (not an expert), armed with a toolbag of mathematical tricks and some common sense.  My own impression of this version of the NoteProcessor (and the MIDI Assembler) is that it is "pretty good".  The generated MIDI is often an "impressionist" version of the audio input.  Cool, but inexact.

Note that the discussion below pertains to real-time generation of MIDI.  Batch generation of MIDI is rather simpler and cleaner, as it is possible to pre-analyze the entire audio file and determine a reasonable set of parameters.

The purpose of this document is to describe as clearly as possible the inner workings of the NoteProcessor so that you can both put it to its best use in its current form, and understand it well enough to come up with new ideas for making it better.  I will continue my own research and experimentation, but I also want your ideas.  Please, let me know what you're thinking!


Note Buffers

AudioExplorer's NoteProcessor consists of 128 note buffers - one for each MIDI note.  Each note buffer is responsible for processing frequencies ranging from a quarter step below its note's center frequency to a quarter step above.  The general purpose of the note buffer is to maintain information regarding the history and current status of signals relevant to its note.

The following figure illustrates processing of frequency information by the NoteProcessor and its note buffers.  Notice that all signals are plotted using logarithmic scaling, so as to better reveal the changes applied at each step.

 


 

Mean Signal Calculation

After each chunk of audio data is analyzed to generate a new frequency spectrum, each of the NoteProcessor's note buffers examines the new spectrum.  Signals for all frequencies inside the note buffer's range are weighted (based on the mean geometric distance for the input frequency from the note's center frequency) and summed, and a normalized mean signal  is calculated.  

The note buffer also calculates a weighted mean frequency, which can then be compared to the ideal note frequency and used to generate MIDI pitch-bend events.

 

Shoulder Merging

If you examine the frequency spectrum in the region of a particularly loud note, you will often observe that the note's signal "bleeds" into neighboring notes.  In some cases, this phenomenon may be real - as when a dense note cluster has been played, or when a particular instrumental sound is highly "detuned".  In many cases, however, this phenomenon results from limitations in the frequency resolution of the spectrum analyzer.  Shoulder Merging is an attempt to compensate for this limitation.

If Shoulder Merging is enabled, the NoteProcessor examines the mean signal of each note.  If the signal is above threshold:

  • Calculate a cutoff signal equal to the note's signal * cutoff percentage
  • Examine the signals of the notes immediately below and above the center note.
  • If the neighboring signals are below the cutoff signal, merge the neighbor's signal into the center note's signal.
Overtone Processing

Handling of overtones is one of the greatest challenges of extracting MIDI from musical audio.  The unique quality of instrumental sounds - e.g., what distinguishes a piano "C" from a violin "C" from a vocal "C" from a guitar "C" - results from the mixture of overtones produced by the instrument.  In other words, a musical note is not simply a note.  A musical note is a rich combination of some fundamental frequency plus any number of overtones which the human ear (more or less unconsciously) recombines into a distinctive instrumental sound.

Suppose that the NoteProcessor examines a frequency spectrum and finds strong signals at frequencies corresponding to C3, C4, and G4.  It is possible, and in fact not at all uncommon, that all 3 of these notes are actual played notes in the performance.  It is also quite possible that the C3 was the only note actually played, and that C4 and G4 show up as first and second overtones of C3.  Given no additional information, it is impossible for AudioExplorer to decide which of these signals correspond to played notes and which are overtones.  In fact, for some instruments (notoriously, the piano), a note's first overtone can be stronger the the note's fundamental.

AudioExplorer allows you to select which (if any) overtones to examine and merge into either a) the note's fundamental; or b) the note's strongest overtone.

If one or more overtones have been selected for overtone processing, the NoteProcessor examines the mean signal of each note in ascending order.  If the signal is above threshold:

  • Calculate the overtone series for the note
  • Sum the signals for the fundamental and each selected overtone.
  • Assign the summed signal to either the fundamental note or the strongest overtone, and set all other signals in the selected series to 0.

Threshold

In essence, the threshold is the value above which a "signal" is considered to be "a note".  The final action of the NoteProcessor is to compare the processed mean signal of each of its note buffers to the threshold.  Signals greater than or equal to the threshold are considered to be active notes, and this information is passed on to the MIDI Assembler.  Notice however that the threshold has already been used several times above to make decisions about whether to merge a note's shoulders and to examine and process a note's overtone series.  The threshold is clearly a most important parameter in the audio-to-MIDI conversion process.

Choosing a threshold for a particular piece of music is substantially a trial-and-error process.  AudioExplorer does provide a signal histogram (accessible from the Spectrum Window; see the figure below), which shows signal distribution for all frequencies at any given time in a musical selection.  In the music that I've examined, I've never seen a clearly identifiable boundary between the signals corresponding to played notes and the signals related to sub-audible overtones, etc.

 

Floating Threshold

A further complication in selecting a threshold has to do with musical dynamics.  A note played softly during a very soft musical phrase might be easily perceived by the human ear, but the same note played the same way during a much louder phrase would not be heard as a discrete note.  To accommodate highly dynamic music, AudioExplorer optionally implements a "floating" threshold.  When using a floating threshold, the NoteProcessor monitors the maximum signal, and adjusts the threshold to be a fixed factor below the maximum.  During moments of silence in the music, this could cause the threshold to drop to very low values, allowing  undesired noise as to be interpreted as "notes".  To prevent this, when the maximum note signal falls below the base threshold value, the NoteProcessor does not adjust the floating threshold.


Maximum Signal

Just as the NoteProcessor needs a threshold to tell it the signal value at which a note "starts", it also needs to know an upper limit for the signal.  The MIDI Assembler uses the threshold and the maximum signal to calculate note velocities for the note-on MIDI events.

Similarly to the threshold, the maximum signal can be either fixed or floating.  A fixed maximum signal is exactly that - a fixed value.  MIDI note-on velocities are calculated based on a note buffer's mean signal, the current threshold, and the (fixed) maximum signal.

A "floating" maximum signal tracks the maximum note signal observed in the current spectral input.  However, if the maximum signal tracked the observed maximum exactly, there would always be at least on MIDI note generated with the maximum note velocity (127).  Since this is not a realistic outcome, I've introduced the concept of a "change rate".  The change rate is a fractional value between 0 and 1.0 which determines how closely the NoteProcessor's maximum signal chases the maximum note signal.  A value of 1.0 causes the tracking to be exact.  A value of 0.5 means that one half of the difference between the NoteProcessor's maximum signal and the observed maximum signal is applied.  The change rate functions as a damper - small values cause the NoteProcessor's maximum signal to change more slowly in response to changes in the level of the input audio signal.