Pitch-Conditioned Instrument Sound Synthesis from an Interactive Timbre Latent Space

Authors: Christian Limberg, Fares Schulz, Zhe Zhang, Stefan Weinzierl

Abstract

This paper presents a novel approach to neural instrument sound synthesis using a two-stage semi-supervised learning framework capable of generating pitch-accurate, high-quality music samples from an expressive timbre latent space. Existing approaches that achieve sufficient quality for music production often rely on high-dimensional latent representations that are difficult to navigate and provide unintuitive user experiences. We address this limitation through a two-stage training paradigm: first, we train a pitch-timbre disentangled 2D representation of audio samples using a Variational Autoencoder, second, we use this representation as conditioning input for a Transformer-based generative model. The learned 2D latent space serves as an intuitive interface for navigating and exploring the sound landscape. We demonstrate that the proposed method effectively learns a disentangled timbre space, enabling expressive and controllable audio generation with reliable pitch conditioning. Experimental results show the model's ability to capture subtle variations in timbre while maintaining a high degree of pitch accuracy. The usability of our method is demonstrated in this interactive web application, highlighting its potential as a step towards future music production environments that are both intuitive and creatively empowering.

Audio Sample Generator

Our interactive interface lets you generate musical samples by selecting points on a 2D plane. To select a pitch for generation, you have two options: 1. use the slider to select the desired note or 2. use your computer keyboard, which is mapped to a MIDI piano starting with note C3 on the 'a' key. The 'q' key allows you to toggle between two octaves. A fraction of the training data is displayed for orientation. Click on a certain location to play the related sample (speaker or headphones recommended).

Model Architecture

This schematic depicts the training procedure of our model. In the first stage, a VAE with a 2D latent bottleneck is trained. In the second stage, the Transformer model is trained with the VAE as a conditioning model.

Cite Us

@inproceedings{DAFx25_paper_58,
    author = "Limberg, Christian and Schulz, Fares and Zhang, Zhe and Weinzierl, Stefan",
    title = "{Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space}",
    booktitle = "Proceedings of the 28-th Int. Conf. on Digital Audio Effects (DAFx25)",
    editor = "Gabrielli, L. and Cecchi, S.",
    location = "Ancona, Italy",
    eventdate = "2025-09-02/2025-09-05",
    year = "2025",
    month = "Sept",
    publisher = "",
    issn = "2413-6689",
    doi = "",
    pages = ""
}

Pitch-Conditioned Instrument Sound Synthesis from an Interactive Timbre Latent Space

Authors: Christian Limberg*, Fares Schulz*, Zhe Zhang, Stefan Weinzierl

Abstract

Audio Sample Generator

Model Architecture

Cite Us

Authors: Christian Limberg, Fares Schulz, Zhe Zhang, Stefan Weinzierl