首页 > 代码库 > [转载]审稿意见分享Modeling Temporal Dependencies in High

[转载]审稿意见分享Modeling Temporal Dependencies in High

 

Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription

by Nicolas Boulanger-Lewandowski, Yoshua Bengio, Pascal Vincent at ICML 2012
We investigate the problem of modeling symbolic sequences of polyphonic music in a completely general piano-roll representation. We introduce a probabilistic model based on distribution estimators conditioned on a recurrent neural network that is able to discover temporal dependencies in high-dimensional sequences. Our approach outperforms many traditional models of polyphonic music on a variety of realistic datasets. We show how our musical language model can serve as a symbolic prior to improve the accuracy of polyphonic transcription.

审稿意见

Posted on behalf of anonymous ICML reviewer.

Summary:
The papers main strengths are the generality of the proposed methods, their clear improvement of the state of the art, the clarity of the presentation, and the thoroughness of the experimental results.

Perhaps the greatest weakness is the degree of novelty: the only new model is the RNN-RBM which is an incremental change from an RTRBM. However some of the model-combinations (such as RNN-NADE) appear to be novel, as is the application to music.
--------------------------------------------------------
Detailed Comments:
The paper addresses the problem of modelling polyphonic musical notes. More specifically it aims to learn a predictive distribution over musical notes at the next timestep, given those at previous timesteps. As well as being interesting in its right, the task is compelling because it combines (relatively) high dimensional, unconstrained density modelling with long-range sequence modelling. Unlike most existing approaches (which use domain-specific representations to reduce the output space) the proposed system allows any note to be emitted at any time, which suggests that it will transfer well to other kinds of high-dimensional sequence.

The paper also demonstrates that the predictive model can be used to improve the performance of polyphonic transcription from audio data - the musical equivalent of combining a language model with an audio model in speech recognition. The combination is performed in a somewhat ad-hoc manner, and requires the audio data to be preprocessed. However it clearly outperforms the baseline HMM smoothing model.

The system is based on various flavour of recurrent neural network. This is attractive for several reasons: RNNs place no restrictions on the amount of previous context used; they can be easily adapted to different kinds of sequential data; and they allow the density modelling and the sequence modelling to be jointly optimised. As well as evaluating several RNN variants found in the literature, the paper introduces a new RNN: the RNN-RBM. Although this only a slight modification of an existing architecture (the RTRBM), the change is clearly motivated and its advantages are demonstrated by the experiments.

The experimental results in Section 6 are exceptionally thorough, comparing the six proposed variants of RNN models with seven common baselines across four different datasets. This makes it possible to tease out the relative pros and cons of the different architectures. I also liked the comparison with the non-temporal frame-level models, which clearly shows how much benefit the sequential models bring. The best proposed systems outperform the baselines on all the datasets, often by a large margin.

The paper is very clearly written, well organized, and commendably concise, given the large amount of material covered.

Suggestions:

Abstract: 
‘in a completely general representation‘ doesn‘t say much. Either remove this or reword it to be more specific.
‘, that is appropriate to discover temporal dependencies in such high-dimensional sequences with the help of Hessian-free optimization and pretraining techniques.‘ -> ‘that is able to discover temporal dependencies in high-dimensional sequences.‘

Introduction:
‘designed for the multiple classification task‘ Does this mean 1-of-K classification? Please clarify.

Section 2:
line 217. If possible, put the exact gradient into the paper to keep it self-contained.

Section 4.1
Is the cross entropy cost in equation (12) the objective function used for the RNN and RNN (HF) models in Section 6? Please clarify.

Section 5
What does Figure 3 add besides an obligatory receptive field picture? I think the space could be better used.

Section 6:
Line 560: What does ‘transposed in a common tonality‘ mean?
Lines 592-598: This sentence isn‘t clear to me. Which datasets are complex and which are simple, and why?
line 629: There‘s an additional n in ‘additionnal‘
line 632: Why do you say RNN-NADE is more robust? RNN-RBM has the best log-loss on 2/4 datasets and the best accuracy on 3. 
Table 1: Put the best score in each column in bold. Maybe also reorder the table according to some overall performance measure... average log-loss?
How big was N for the note N-gram and N-gram models, and how many different N-grams were needed? I would like to see this recorded for each dataset.

Conclusions:
Make sure you take out the bold text for the final version.

 

原文链接

 

http://icml.cc/discuss/2012/

 http://icml.cc/discuss/2012/590.html