Voice and Staff Separation in Symbolic Piano Music with GNNs

The massive query is how can we make automatic transcription models higher.

To develop a simpler system for separating musical notes into voices and staves, particularly for complex piano music, we’d like to rethink the issue from a unique perspective. We aim to enhance the readability of transcribed music ranging from a quantized MIDI, which is significant for creating good rating engravings and higher performance by musicians.

For good rating readability, two elements are probably an important:

the separation of staves, which organizes the notes between the highest and bottom staff;
and the separation of voices, highlighted on this picture with lines of various colours.

In piano scores, as said before, voices should not strictly monophonic but homophonic, which suggests a single voice can contain one or multiple notes playing at the identical time. Any further, we call these chords. You’ll be able to see some examples of chord highlighted in purple in the underside staff of the image above.

From a machine-learning perspective, we’ve got two tasks to resolve:

The primary is staff separation, which is simple, we just must predict for every note a binary label, for top or bottom staff specifically for piano scores.
The voice separation task could seem similar, in spite of everything, if we are able to predict the voice number for every voice, with a multiclass classifier, and the issue could be solved!

Nonetheless, directly predicting voice labels is problematic. We would wish to repair the utmost variety of voices the system can accept, but this creates a trade-off between our system flexibility and the category imbalance inside the data.

For instance, if we set the utmost variety of voices to eight, to account for 4 in each staff because it is often done in music notation software, we are able to expect to have only a few occurrences of labels 8 and 4 in our dataset.

Looking specifically on the rating excerpt here, voices 3,4, and eight are completely missing. Highly imbalanced data will degrade the performance of a multilabel classifier and if we set a lower variety of voices, we might lose system flexibility.

The answer to those problems is to have the ability to translate the knowledge the system learned on some voices, to other voices. For this, we abandon the thought of the multiclass classifier, and frame the voice prediction as a link prediction problem. We would like to link two notes in the event that they are consecutive in the identical voice. This has the advantage of breaking a posh problem right into a set of quite simple problems where for every pair of notes we predict again a binary label telling whether the 2 notes are linked or not. This approach can be valid for chords, as you see within the low voice of this picture.

This process will create a graph which we call an output graph. To search out the voices we are able to simply compute the connected components of the output graph!

To re-iterate, we formulate the issue of voice and staff separation as two binary prediction tasks.

For staff separation, we predict the staff number for every note,
and to separate voices we predict links between each pair of notes.

While not strictly obligatory, we found it useful for the performance of our system so as to add an additional task:

Chord prediction, where just like voice, we link each pair of notes in the event that they belong to the identical chord.

Let’s recap what our system looks like until now, we’ve got three binary classifiers, one which inputs single notes, and two that input pairs of notes. What we’d like now are good input features, so our classifiers can use contextual information of their prediction. Using deep learning vocabulary, we’d like an excellent note encoder!

We elect to make use of a Graph Neural Network (GNN) as a note encoder because it often excels in symbolic music processing. Subsequently we’d like to create a graph from the musical input.

For this, we deterministically construct a brand new graph from the Quantized midi, which we call input graph.